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Foreword 



For many years the intersection of computing and data analysis contained menu- 
based statistics packages and not much else. Recently, statisticians have em- 
braced computing, computer scientists have started using statistical theories 
and methods, and researchers in all corners have invented algorithms to find 
structure in vast online datasets. Data analysts now have access to tools for 
exploratory data analysis, decision tree induction, causal induction, function es- 
timation, constructing customized reference distributions, and visualization, and 
there are intelligent assistants to advise on matters of design and analysis. There 
are tools for traditional, relatively small samples, and also for enormous datasets. 
In all, the scope for probing data in new and penetrating ways has never been 
so exciting. 

The IDA-99 conference brings together a wide variety of researchers con- 
cerned with extracting knowledge from data, including people from statistics, 
machine learning, neural networks, computer science, pattern recognition, data- 
base management, and other areas. The strategies adopted by people from these 
areas are often different, and a synergy results if this is recognized. The IDA 
series of conferences is intended to stimulate interaction between these different 
areas, so that more powerful tools emerge for extracting knowledge from data and 
a better understanding is developed of the process of intelligent data analysis. 
The result is a conference that has a clear focus (one application area: intelligent 
data analysis) and a broad scope (many different methods and techniques). 

IDA-99 took place in Amsterdam from 9-11 August 1999. The invited speak- 
ers were Jacqueline Meulman (Leiden University, The Netherlands), Zdzislaw 
Pawlak (Warsaw Institute of Technology, Poland), and Paul Cohen (Univer- 
sity of Massachusetts, Amherst, United States). The conference received more 
than 100 submissions. During a meeting of the program committee organized 
by Xiaohui Liu at Birkbeck College in London, 21 papers were selected for oral 
presentation and 23 papers were selected for poster presentation. 

We want to express our thanks to all the people involved in the organiza- 
tion of IDA-99: especially Paul Cohen for the initial discussions, Xiaohui Liu 
and Michael Berthold for all their work behind the scenes, once again Michael 
Berthold for the preparation of the proceedings, and Daniel Tauritz, Walter 
Kosters, Marloes Boon-van der Nat, Frans Snijders, Mieke Brune, and Arno 
Siebes for the local organization. 
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Abstract. The main focus of theoretical models for machine learning is 
to formally describe what is the meaning of learnable, what is a learn- 
ing process, or what is the relationship between a learning agent and 
a teaching one. However, when we prove from a theoretical point of 
view that a concept is learnable, we have no a priori idea concerning 
the difficulty to learn the target concept. In this paper, after reminding 
some theoretical concepts and the main estimation methods, we provide 
a learning-system independent measure of the difficulty to learn a con- 
cept. It is based on geometrical and statistical concepts, and the implicit 
assumption that distinct classes occupy distinct regions in the feature 
space. In such a context, we assume the learnability to be identify by 
the separability level in the feature space. Our definition is constructive, 
based on a statistical test and has been implemented on problems of 
the UCI repository. The results are really convincing and fit well with 
theoretical results and intuition. Finally, in order to reduce the compu- 
tational costs of our approach, we propose a new way to characterize the 
geometrical regions using a fc-Nearest-Neighbors graph. We experimen- 
tally show that it allows to compute accuracy estimates near from those 
obtained by a leave-one-out-cross-validation and with smaller standard 
deviation. 



1 Introduction 

As in a lot of fields of computer science, we can distinguish two trends in the 
machine learning community: the theoretical one and the practical one. In the 
first one, people are interested in developing formal models describing as close 
as possible what is a learning process. jS] developed an exact learning model, 
restricted to recursive primitive functions and without any time or complexity 
requirements. It was a kind of idealized model (learning in the limit) capturing 
only a small part of human learning and really too far from practical require- 
ments. The paper of Valiant m was an attempt to relax such an idealized model 
and to define a theory of the learnable where some algorithmic and statistic con- 
straints are introduced. Today, this theory based on the PAG learnability, is 
often the starting point of new theoretical research works. 
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On the other hand, because of theoretical constraints, we find researchers 
who develop tools and algorithms applicable in practical situations. Their aim 
consists in developing efficient machines able to correctly learn concepts more or 
less complicated. The relevance of these machines is often numerically estimated 
taking into account both algorithmic and statistical constraints, as presented 
in the PAG model. Statistical criteria |ni[Tl[Ta|T71lin| or accuracy estimates 
Biin] are then often used to measure a certain degree of learnability. 

According to the different approaches of these two trends, one may well won- 
der if a learnable concept from a practical standpoint is still learnable from a 
theoretical point of view, and vice versa. In this article, starting from theoretical 
considerations about learnability mug, we progressively show how we derive 
to practical estimation methods used in machine learning. We propose then a 
first statistical approach to measure learnability regardless of learning methods. 
Actually, we think that the learnability degree of a given concept is an intrinsic 
property of its representation in the feature space. The idea to assess the rel- 
evance of the feature space independently of a learning method is not new. It 
is the original concept of the filter models in the feature selection field jg. Our 
approach is based on the construction on the well-known Minimum Spanning 
Tree (MST) from which we characterize homogeneous subsets, deleting some 
particular edges. We build a statistical test on this number of deleted edges and 
use the critical threshold of the test to assess a learnability degree. In order 
to show the interest of our approach, we present some experimental results on 
benchmarks of the UCI repository, comparing our a priori decision rule with a 
posteriori accuracy estimates computed by three induction methods: C4- 5 m. 
ID3 ^2] and the k-Nearest Neighbors {k-NN). 

The main drawback of the MST is its computational cost. In order to deal 
with very big databases that we often find with new data-acquisition technologies 
(the World Wide Web for instance), we propose in the last section to replace 
the MST by the k-NN graph, which has a lower complexity. This new way to 
proceed modifies a little the property of learning-method independence, without 
challenging its principles. Actually, even if we use the fc-NN to characterize the 
homogeneous subsets, we do not appeal to any induction rule for classifying a 
new instance. We experimentally show that our statistical variable built from 
this new graph not only gives a good accuracy estimate, but also provides a 
smaller standard deviation. Experimental results seem to confirm this property. 

2 Theoretical Considerations 

2.1 PAG Learnability 

The task assigned to a learning machine is to approximate a target eoncept f 
and such a machine could be decomposed in two parts consisting in a learning 
protoeol and a deduetion proeedure |l . The learning protoeol describes the way 
to represent information we dispose to learn / and the way we access to this 
information. The machine has to recognize whether / is true or not for a given 
data belonging to a set f2. One of the natural procedures to access to information 
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is called EXAMPLE and provides a finite set of examples 17 q C 17, the training 
set, positively or negatively exemplifying the target concept. 17 is supposed to 
be given with an unknown but fixed probability distribution P. The deduction 
process is in fact an algorithm A invoking the protocol and which has to produce 
an approximation h of the target concept /. 

Definition 1. The target concept f is considered as learnable if and only if it 
exists an algorithm A producing a concept h in a finite number of steps and sat- 
isfying the following property usually denoted PAC (as Probably Approximately 
Correct). For each e,S G [0, 1], with probability at least 1 — <5, we must have 

So PAC-learnability is just stated in terms of existence of an algorithm satis- 
fying PAC property. The PAC-property has been refined PQ, and some more 
sophisticated learning protocols have been studied El El, but the basic intuition 
has not been really modified. We may point out some drawbacks of such models. 
Since the P distribution is unknown, there is no way to compute the theoreti- 
cal risk of error (the left term of the inequality). Furthermore, PAC learnability 
does not provide training algorithms but rather gives formal criteria to deter- 
mine whether a given algorithm runs in a satisfactory way : in case of positive 
result, some concepts are “easily” learnable since others give rise to a lot of dif- 
ficulties. PAC learnability does not differentiate such situations. In some sense, 
the works of m allow to overcome some of the previous drawbacks by giving an 
upper bound for the error probability and providing a general scheme of training 
algorithms, namely the Support Vector Machines. 



2.2 Support Vector Machines 

Support Vector Machines (SVM) have been introduced by Vapnik m and are 
theoretically founded on the Vapnik and Chervonenkis dimension theory. Let 
us recall here what is the VC dimension. Given a set 17, we consider a finite 
subset (i.e. a concept) A of 17 and a family P of subsets of 17. We denote 
AVP ={A n F \ F € this set is often called the track of T over A. Of 
course, AVT C 2^ where 2^^^ is the powerset of A. So A is shattered by P iff 
AnP=2\^. If A is shattered by P, this means that each partition of A can be 
described by P. The VC-dimension h of P is just the least upper bound of the 
set {| A I I A is shattered by IF}. So the VC-dimension is an integer (sometimes 
infinite) which measures, in some sense, the complexity of the given model P. If 
the elements F of P are described by functions, we see that the VC-dimension 
h is linked to a family of functions used to discriminate input data, so the 
output of the training algorithm is supposed to be one of these functions (linear 
function, polynomials, potential functions, etc.). One of the main advantages 
of this approach is the fact that the general risk is bounded by the empirical 
risk which is computable, added to an other quantity Q depending on h. In the 
previous notation of PAC, we have: 
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^ P{u;) < Y. 

{uieO\f{uj)^h{ui)} {ujeOa\f{uj)^h{ui)} 

where Q is a quantity linked to m, the cardinality of the training space, and 
to h, the VC dimension of the given model. Q is a decreasing function of 
So, if ^ has a huge value, it is then sufficient to reduce the empirical risk to 
control the real risk. Ideally, the best situation is when m is very large, but 
generally this is not the case: so it is necessary to control the VC dimension 
h. Starting from such considerations, Vapnik provides a training scheme called 
Support Vector Machines. In the simple case of a two class problem, the main 
idea is to discriminate two point sets by building a separating hyper-plane. This 
plane is chosen in such a way to maximize its distance (called the margin) with 
the closest data. This way not only fits with theoretical results but also with 
the intuition: such an hyper-plane would be more robust since when adding a 
new point close to an element of a given class, this point will be far from the 
hyper-plane and so will be (with high probability) correctly classified. However, 
a main drawback is yet to be pointed out: an a priori model is fixed («. e. a set of 
functions) to tackle a problem and we are not sure that a function of this model 
will be a good separator. 

3 Practical Considerations 

3.1 Notations 

Let consider a set fi of instances, each instance ui being characterized by the 
values of p features. We associate to w a p-vector, X{uj) = (a;i(a;), . . . , Xp{oj)) G 
X = Xi X ... X Xp, each Xi belonging to a set W. X is the representation 
space or the feature set of 17. 17 is supposed to be portioned into a finite number 
of classes ci,..,Cm- For only a finite number n of elements a; in 17 we know 
the corresponding class c(w): it constitutes a finite subset f2a C f2 often called 
the training set (n = |l7a|). The aim of the learning algorithm is to generate a 
mechanism allowing to compute the class of a new element uj which does not 
belong to I 7 a. Of course, the underlying hypothesis is that c{uj) only depends on 
the vector X(uj). 

3.2 Accuracy Estimates 

Instead of proving that a given concept is learnable or not, it is possible to esti- 
mate a learnability degree of the target concept with regard to a given learning 
algorithm, computing an a posteriori accuracy. Different a posteriori methods 
are thus available: (i) the holdout method P33, ( ii) the k- cross-validation method 
CH, ( iii) the bootstrap procedure |3j. However, these a posteriori methods not 
only may require high computational costs, but also are dependent on the learn- 
ing method to compute the learnability degree, that is not our first objective. 

Moreover, recent works have proven that the accuracy is not always a suitable 
criterion. In |S|, a formal proof is given that explains why Gini criterion and the 
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entropy should be optimized instead of the accuracy when a top-down induction 
algorithm is used to grow a decision tree. Same kind of conclusions are presented 
with another statistical criterion in m 

3.3 Statistical Criteria 

Another way to measure the learnability consists then in using a priori statistical 
criteria, regardless of the learning algorithm. This way to proceed is based on the 
intuition that the difficulty to learn a concept directly depends on the feature 
space, and not on the learning method applied. Actually, the same concept may 
be more learnable in a feature space than in an other. Then, measuring the 
learnability of a eoncept ean amount to estimate the relevance of the feature space 
(see for a survey of relevance definitions) . We recall here different criteria used 
to assess relevance. 

— Interclass distance: the average distance between instances belonging to dif- 
ferent classes is a good criterion to measure the relevance of a given feature 
space. However, the use of this criterion is restricted to problems without 
mutual class overlaps. 

— Class Projection: this approach estimates the relevance of a given feature 
using conditional probabilities ini 

— Entropy: one can speak about feature relevance in terms of information the- 
ory. One can then use the Shannon’s mutual information |2ni; see also HH 
where the cross-entropy measure is used, or dE| which uses a quadratic en- 
tropy computed from a neighborhood graph. 

— Probabilistic distance: in order to correctly treat class overlaps, a better ap- 
proach consists in measuring distances between probability density functions. 
It often leads to the construction of homogeneity tests m. 

The homogeneity tests are very interesting because they provide rigorous tools 
giving a good idea about the difficulty to learn a concept. Unfortunately, none 
of these tests currently proposed in literature, is both non parametric and ap- 
plicable in X^, with any type of attributes {nominal, continuous, discrete). We 
present in the next section a new statistical test overcoming these constraints. 

4 The Test of Edges 

4.1 Introduction 

In the context of probabilistic distances, we can bring to the fore two types of 
extreme situations: 

1. The most difficult situations are those where classes have the same proba- 
bility distribution, i.e.: 

fi{x) = f 2 {x) = ... = fk{x) (called the null hypothesis in our test) 

where fi{x) is the probability for a given vector a; G A to be in c^. 
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During the learning phase, instances belonging to these classes will just be 
learned by heart {overfitting situation). 

2. On the contrary, the most comfortable situations are those where classes 
have totally separated probability distributions. 

The goal of our test is then to measure how far we are from the worst situation 
{i.e. where distributions are the same ones). The notion of distance is subjacent 
to our definition. Introduced in m in the pattern recognition field, the first 
version of the test of edges was limited to continuous attributes i.e. when X = 
IBf . Recent works dealing with nominal attributes thanks to specific distance 
functions m allow to generalize here this test to mixed spaces. We detail in this 
section the useful aspects of the test of edges. 

4.2 Formalism 

The main idea is to decompose the space X into disjoint subsets, each of these 
subsets representing elements of the same class: so these subsets are considered 
as homogeneous. To do that, we use information contained in the MST. 

Definition 2. A tree is a eonnected graph without cycles. A subgraph that spans 
all vertices of a graph is called a spanning subgraph. A subgraph that is a tree 
and that spans all vertices of the original graph is called a spanning tree. 

Definition 3. Among all the spanning trees of a weighted and connected graph, 
the one with the least total weight is called a Minimum Spanning Tree (MST). 

So, if we have a distance d over X, we can easily build a MST considering the 
weight of an edge as the distance between its extremities. Our approach is based 
on the search in the MST for homogeneous subsets. 

Definition 4. Given G = (V,E), a graph composed by v =\ V \ vertices and 
e =\ E \ edges. Given G' a subgraph of G,G' = {V',E'). G' is an homogeneous 
subset if and only if: 

1. All points of G' belong to the same class. In our case where V C X, this 
means : yx{uji), X{ujj) G V , c{uji) = c{ujj). 

2. G' is connected. 

Starting from the training set f2a and given a distance d over X, we get the 
homogeneous subsets with the following natural procedure: 

— step 1: Construction of the MST over 12 a which is composed of n — 1 edges. 

— step 2: Deletion of edges which connect 2 points belonging to different classes. 

We may notice that each deletion in step 2 means that two elements, whose 
representations are very similar, belong in fact to distinct classes: so the number 
D of deleted edges is a main factor to estimate a learnability degree. In fact, 
D has to be estimated with regard to the initial number of edges n — 1: so our 
random variable is (the proportion of deleted edges to obtain homogeneous 
subsets) . In the case where the distribution of classes are equal, we will have a 
great probability p to remove an edge. So the degree of learnability may 
be estimated according to the comparison between p and 
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4.3 The Law of — ^ 

With n instances we can build edges. Among these edges, only n — 1 

will be really built with the MST. Considering as a success (with a p success 
probability) the deletion of an edge of the MST, D corresponds to the number 
of success. We can then deduce that the law of D is an hyper-geometric one. If 
the n size is high enough, the asymptotic normality of D is verified. 



4.4 The Value of p 



Recall that p is the probability under the null hypothesis to delete an edge. In 
this case, 2 conditions must be verified: (i) the 2 points (uji,ujj) linked by this 
edge must be neighbors in the MST; (ii) they do not belong to the same class. 
The probability to delete an edge is: p = P[{Nij = true)r\{c{uji) yf c{ojj))], where 
{Nij = true) means that tOi is the neighbor of ujj. Under the null hypothesis, the 
events ”<o be neighbor'^ and ”to belong to the same class” are independent. We 
deduce that p = P{Nij = true) x P{c{uJi) yf c{ojj)). 

We can easily find that, 

- P{Nij = true) = „7„U\) 

2 

— P{c{uii) yf c{ujj)) = — for fc > 2 



Thus, the probability p to delete an edge is: 

n — 1 \^i=i niUj 

P = — 7 — TT X r— n 

^ n(n— 1} n(n— 1) 

2 2 

where Ui is the number of instances of the class i. In conclusion, 



n _ 1) „ , n - I 

D = H ( , n — 1, —7 — 77 

'' 2 ’ ’ n(n— 1 ) 



E k— 1 

X )■ 



We can use the Oc critical threshold of this test to measure the degree 
of learnability between the 2 extreme situations: unlearnability (oc = 1> *-c- 
fi{x) = f 2 {x)) and strong learnability (oc = 0, i.e. /i H /2 = 0). Thus, we define 
the (1 — ac)-learnability as follows : 

Definition 5. A concept is (1 — ac) -learnable iff the test of edges provides a 
critical threshold, satisfying P[fi{x) = ... = fk{x)] = ac- 



4.5 Distance Functions 

The construction of the MST requires a distance over the feature set X. When 
features have real values, standard Euclidean metric is sufficient. A lot of recent 




10 Marc Sebban and Gilles Richard 



works have been devoted to define adequate distances for spaces with mixed fea- 
tures. In new heterogeneous distance functions are proposed, called the Het- 
erogeneous Euclidean-Overlap Metric (HEOM), the Heterogeneous Value Dif- 
ference Distance (HVDM), the Interpolated Value Difference Metric (IVDM), 
and the Windowed Value Difference Metric ( WVDM). These distance functions 
properly handle nominal and continuous input attributes and allow the construc- 
tion of a MST in mixed spaces. They are inspired by the Value Difference Metric 
im. It defines the distance between two values x and y of a nominal attribute 
a: 



~ Na^x is the number of instances that have value x for attribute a, 

— Na^x, ci is the number of instances in f2a that have value x for attribute a 
and output class c,, 

“ Pa,x,ci is the conditional probability that the output class is Ci given that 
attribute a has the value x, i.e., P{ci/Xa = x), 

— k is the number of output classes, and y is a constant. 

For continuous attributes, HEOM, HVDM, IVDM and WVDM uses different 
strategies (normalization, discretization). We have integrated these new distance 
functions into our test in order to compare our approach on any problem with 
any learning methods. 

5 Experimental Comparisons and Results on the MST 

In order to bring to the fore the interest of our approach, we have to compare 
the learnability degree a priori determined by our test with the ability to learn 
the problem by a given learning algorithm {i.e. the a posteriori accuracy). We 
worked on 10 databases extracted from the UCI Repositorjfl. Moreover, we have 
simulated an artificial problem (called Artificial) for which we know the over- 
lapping degree and the Oc (near from 1). For each dataset, we have applied the 
following experimental set-up: 

1. we used the C4.5 learning algorithm HSl to determine an accuracy estimate 
by cross-validation. 

2. we applied also the IDS algorithm m to determine a second accuracy rate. 

3. a /c-Nearest Neighbors classifier was run with k = 10. 

4. finally, we computed the proportion of deleted edges from the MST, 
and the associated critical threshold Oc- 



vdma{x,y) = 




k 



where 



^ http:/ /www.ics.uci.edu/'mlearn/MLRepository.html 
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Table 1. Critical Threshold, 1 — and Accuracy Rates with 3 learning 
algorithms 



Database 


feat. 


n 


C4.5 


ID3 


k-NN 


etc 


1-^ 


Artificial 


10 


1000 


51.0 


50.7 


50.1 


1 


50.1 


Monks-2 


6 


432 


65.0 


69.7 


62.0 


0.95 


63.5 


Vehicle 


18 


846 


69.9 


71.8 


69.5 


10"“ 


70.2 


Pima 


8 


768 


72.7 


73.8 


70.2 


5.10“''" 


71.0 


Horse 


22 


368 


84.8 


75.3 


71.1 


10“"" 


72.0 


Monks- 1 


6 


432 


75.7 


82.4 


71.0 


10“"" 


72.3 


Cleveland 


7 


297 


72.3 


72.4 


75.0 


9.10“"" 


74.7 


Australian 


14 


690 


85.6 


80.4 


80.0 


10“"" 


80.2 


Monks-3 


6 


432 


97.1 


90.3 


90.0 


10“"“ 


90.0 


Vote 


15 


435 


95.6 


94.0 


91 


10“""" 


91.2 


Breast Cancer 


10 


699 


95.7 


94.6 


96.3 


10“""" 


95.8 



A first way to present the results would consist in sorting the Oc critical thresh- 
olds to have a classification according to the learnability degree of each problem, 
and compare this order with the accuracies computed by the 3 learning algo- 
rithms. However, contrary to the critical threshold, the accuracy estimates do 
not take into account the number of classes and the number of instances of each 
class. A 80% accuracy rate on 2 classes can not be compare with a 80% accu- 
racy on 10 classes. So, we can not directly compare the accuracies with Oc- To 
avoid a normalization of the accuracies, a best comparison would consist in us- 
ing directly the statistical variable 1 — instead of Oc- Actually, this quantity 
intuitively measures in a way the guarantee for a new instance to be surrounded 
by examples of the same classes, i.e. to be correctly classified. Moreover, this 
strategy does not challenge the interest of Oc to determine the unlearnable prob- 
lems. Sorting datasets according to the value of 1 we obtain the table 

□ 

We can note that: 



1. With a classical a = 5% risk, we can a priori conclude that Artificial and 
Monks-2 problems are statistically unlearnable ( Oc > a). So, our conclusion 
is that it seems to be useless to try to learn these 2 problems, with any 
learning algorithm. This conclusion is a posteriori confirmed by the accuracy 
estimates computed with the 3 learning methods. (Note that the number of 
instances is different for each class and that the random accuracy on Monks-2 
is then about 55.9%) 

2. The order is strictly kept by the accuracies computed with a k-NN clas- 
sifier. This phenomenon is not surprising because the MST is a particular 
neighborhood graph, not very far from the INN topology. 

3. That is more interesting is the order obtained with the two other classifiers. 
Excluding one database (Monksl), ID3 keeps the same order. For C4.5, three 
permutations are necessary to keep the same classification. 
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Table 2. Difference d for different sample sizes, and different values of fj ,2 



d 


1 1 


n 


3.0 


2.0 


1.5 


1.2 


50 


0.6 


1.0 


1.9 


2.3 


100 


0.7 


1.0 


1.2 


2.2 


150 


0.5 


0.9 


1.0 


1.9 


200 


0.2 


0.9 


0.9 


1.2 


300 


0 


0.2 


0.6 


0.9 


400 


0.3 


0.1 


0.5 


0.7 


500 


0.1 


0.3 


0.5 


0.7 



So, in the majority of cases, we can advance that our approach gives an exact idea 
of the learnability degree of each studied problem. Actually, the experimental 
results a posteriori confirm the decision of our statistical test. 

6 The Test of Edges on a k-NN Graph 

While our approach provides good results, it presents a shortcoming: its com- 
plexity. Actually, the construction of the MST requires 0(n^ log n). A solution 
will consist in replacing the MST by another neighborhood graph with lower 
complexity, and that keeps the notion of homogeneous subset. The simplest is 
certainly the /c-NN graph, which requires with the worst algorithm 0{n^). Even 
if we use the A: -Nearest Neighbors to characterize the homogeneous subsets, we 
do not appeal to any induction rule for classifying a new instance. So, in such a 
context we do not challenge our first objective to have an independent measure, 
even if the random variable now turns towards an accuracy estimate. Actually, 
with k = 1, compute the quantity 1 — |^ (with \ E \= kn) is strictly equiva- 
lent to compute an accuracy estimate using a Leave-One-Out Cross-Validation 
procedure (LOOCV) with a 1-NN classifier. On the other hand, it is not the 
case when fc > 1. Particularly when k is even, in case of ties, a tie breaking rule 
chooses randomly a nearest-neighbor. This uncertainty does not exist with our 
strategy because the random variable is only built from the number of deleted 
edges, without any induction rule. According to this remark, we can expect not 
only to compute good accuracy estimates (very close to the LOOCV) but also 
to assure smaller standard deviation on small learning samples. 

Before to get a theoretical proof of this property in future works, we show 
here some first experimental results. We have simulated samples of two classes 
according to a o przorz-known distributions (for class 1, Wj, Xi = N{pti = 1, ct); 
for class 2, Wi, Ai = N{p, 2 ,cr), fii < ^ 2 )- We varied the number of instances, and 
we also modified the distributions in order to increase the degree of overlapping 
{p ,2 <— /i 2 ~ £)• For different values of k we have computed the distributions of 
the 1 — quantity and the Acc accuracy estimate by LOOCV using a k-NN 
classifier. While in the majority of cases, means of 1 — and Acc are very 
close, the difference between standard deviations d = gacc — p seems to 
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be significant. Results for k = 2 are presented in tabled We can note that the 
standard deviation of 1— is always smaller than the one computed by LOOCV 
{d > 0). For small sample sets with a high overlapping degree (/i 2 « /ri « 1), 
the difference is very important because of the tie breaking rule. Progressively, 
with bigger learning sets, this difference tends to zero. 

7 Conclusion 

In this paper, we have developed a statistical test to a priori compute a measure 
of learnability for a given problem, represented in any feature space. This test is 
based over a Bayesian approach: the underlying hypothesis is that distribution 
of classes in the training set is a good mirror of the difficulty to build a learning 
machine. So, we compute a probability measuring, in some sense, the overlapping 
of the target classes. This probability is considered as a good measure of learn- 
ability for a given representation space of the problem. We implement this test 
and experiment it over 11 databases: practical results are in accordance with the 
expected ones. The high complexity of our algorithm turns our researches onto 
new geometrical graphs to characterize the homogeneous subsets. A preliminary 
experimental study shows interesting results with the k-NN graph, that would 
deserve new investigations in future works. 
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Abstract. In this paper we introduce Accurate Linguistic Modelling, 
an approach to design linguistic models from data, which are accurate 
to a high degree and may be suitably interpreted. Linguistic models 
constitute an Intelligent Data Analysis structure that has the advantage 
of providing a human-readable description of the system modelled in the 
form of linguistic rules. Unfortunately, their accuracy is sometimes not 
as high as desired, thus causing the designer to discard them and replace 
them by other kinds of more accurate but less interpretable models. 
ALM has the aim of solving this problem by improving the accuracy of 
linguistic models while maintaining their descriptive power, taking as a 
base some modifications on the interpolative reasoning developed by the 
Fuzzy Rule-Based System composing the model. In this contribution we 
shall introduce the main aspects of ALM, along with a specific design 
process based on it. The behaviour of this learning process in the solving 
of two different applications will be shown. 



1 Introduction 

Nowadays, one of the most important areas for the application of Fuzzy Set The- 
ory as developed by Zadeh in 1965 HH are Fuzzy Rule-Based Systems (FRBSs). 
These kinds of systems constitute an extension of classical Rule-Based Systems, 
because they deal with fuzzy rules instead of classical logic rules. 

In this approach, fuzzy IF-THEN rules are formulated and a process of fuzzi- 
fication, inference and defuzzification leads to the final decision of the system. 
Although sometimes the fuzzy rules can be directly derived from expert knowl- 
edge, different efforts have been made to obtain an improvement on system 
performance by incorporating learning mechanisms guided by numerical infor- 
mation to define the fuzzy rules and/or the membership functions associated to 
them. Hence, FRBSs are a suitable tool for Intelligent Data Analysis where the 
structure considered to represent the available data is a Fuzzy Rule Base. 

From this point of view, the most important application of FRBSs is system 
modelling uni, which in this field may be considered as an approach used to 

* This research has been supported by CICYT TIC96-0778 
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model a system making use of a descriptive language based on Fuzzy Logic 
with fuzzy predicates El. In this kind of modelling we may usually find two 
contradictory requirements, accuracy and interpretability. 

When the main requirement is the accuracy, descriptive Mamdani-type FRBSs 
0 are considered which use fuzzy rules composed of linguistic variables that take 
values in a term set with a real-world meaning. This area is called Fuzzy Lin- 
guistic Modelling due to the fact that the linguistic model consists of a set of 
linguistic descriptions regarding the behaviour of the system being modelled 
El. Nevertheless, the problem is that sometimes the accuracy of these kinds of 
models is not sufficient to solve the problem in a right way. In order to solve 
this problem, in this paper, we introduce Accurate Linguistic Modelling (ALM), 
a Linguistic Modelling approach which will allow us to improve the accuracy of 
linguistic models without losing its interpretability to a high degree. 

To do so, this contribution is set up as follows. In Section El a brief intro- 
duction to FRBSs is presented with a strong focus on descriptive Mamdani-type 
ones. Section El is devoted to introduce the basis of ALM. In Section 0, a Lin- 
guistic Modelling process based on it is proposed. In Section El the behaviour of 
the linguistic models generated to solve two different applications is analysed. 
Finally, in Section 0 some concluding remarks will be pointed out. 

2 Fuzzy Rule-Based Systems 

An FRBS presents two main components: 1) the Inference Engine^ which puts 
into effect the fuzzy inference process needed to obtain an output from the FRBS 
when an input is specified, and 2) the Fuzzy Rule Base, representing the known 
knowledge about the problem being solved in the form of fuzzy IF-THEN rules. 

The structure of the fuzzy rules in the Fuzzy Rule Base determines the type 
of FRBS. Two main types of fuzzy rules are usually found in the literature: 

1. Descriptive Mamdani-type fuzzy rules [Zj — also called linguistic rules — which 
present the expression: 

IF Xi is Ai and ... and A„ is A„ THEN Y is Bi 

with Xi, . . . , Xn and Y being the input and output linguistic variables, re- 
spectively, and Ai,. . . ,A„ and B being linguistic labels, each one of them 
having associated a fuzzy set defining its meaning. 

2. Takagi-Sugeno-Kang (TSK) fuzzy rules El, which are based on representing 
the consequent as a polynomial function of the inputs: 



IF Xi is Ai and ... and A„ is A„ THEN Y = pi ■ Xi pn ■ A„ -|- po 



with po,pi, . . . ,pn being real- valued weights. 

The structure of a descriptive Mamdani-type FRBS is shown in Figure [D 
As can be seen, and due to the use of linguistic variables, the Fuzzy Rule Base 
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Fig. 1. Generic structure of a descriptive Mamdani-type Fuzzy Rule-Based Sys- 
tem 

becomes a Knowledge Base (KB) composed of the Rule Base (RB), constituted 
by the collection of linguistic rules joined by means of the connective also, and 
of the Data Base (DB), containing the term sets and the membership functions 
defining their semantics. 

On the other hand, the Inference Engine is comprised by three components: 
a Fuzzification Interface, which has the effect of transforming crisp input data 
into fuzzy sets, an Inference System, that uses these together with the KB to 
perform the fuzzy inference process, and a Defuzzifieation Interface, that obtains 
the final crisp output from the individual fuzzy outputs inferred. 

The Inference System is based on the application of the Generalized Modus 
Ponens, extension of the classical logic Modus Ponens. It is done by means of 
the Gompositional Rule of Inference, which in its simplest form is reduced to Pj: 

= I{yAAxo),l^B{y)) 

with xq = {x\, . . . ,Xn) being the current system input, (a^o) = T{Ai{x \), . . . , 
An{xn)) being the matching degree between the rule antecedent and the input 
— T is a conjunctive operator (a t-norm) — and I being a fuzzy implication 
operator. 

The Gompositional Rule of Inference is applied to each individual rule, thus 
obtaining an output fuzzy set from each rule in the KB. The Defuzzification 
Interface aggregates the information provided by these fuzzy sets and transforms 
it into a single crisp value by working in one of the two following ways Pj: 

1. Mode A: Aggregation first, defuzzification after: The individual fuzzy sets 
inferred are aggregated to obtain a final fuzzy set B' by means of a fuzzy 
aggregation operator G — which models the also operator that relates the 
rules in the base — . Then, a defuzzification method D is applied to transform 
the latter into a crisp value yo that will be given as system global output: 

MB'(y) = G{/is'(?/),AiB-(y),...,^B;(2/)} ; yo = D{fiB'{y)) 

Usual choices for G and D are, respectively, the minimum and maximum 
operators and the Gentre of Gravity and Mean of Maxima defuzzification 
methods. 
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2. Mode B: Defuzzification first, aggregation after: In this case, the contribution 
of each fuzzy set inferred is individually considered and the final crisp value 
is obtained by means of an operation (an average, a weighted average, or the 
selection of one of them, among others) performed on a crisp characteristic 
value of each one of the individual fuzzy sets. 

The most commonly used characteristic values are the Centre of Gravity 
and the Maximum Value Point. Several importance degrees are considered 
to select or weight them, the matching degree of the rule and the area or the 
height of the consequent fuzzy set among others 

3 ALM: An Approach for Generating Accnrate Linguistic 
Models for Intelligent Data Analysis 

One of the most interesting features of an FRBS is the interpolative reasoning 
it develops, which plays a key role in its high performance and is a consequence 
of the cooperation among the fuzzy rules composing the KB. As mentioned in 
the previous Section, the output obtained from an FRBS is not usually due to 
a single fuzzy rule but to the cooperative action of several fuzzy rules that have 
been fired, because they match the input to the system to some degree. 

ALM will deal with the way in which the linguistic model make inference in 
order to improve its accuracy while not losing its description. Hence, it will be 
based on two main aspects that will be described in the two following subsections. 
The remaining one in this Section analyses some interesting remarks of the 
proposed approach. 

3.1 A New Descriptive Knowledge Base Structure for Locally 
Improving the Model Accuracy 

Some problems derived from the inflexibility of the concept of linguistic variable 
(see D) makes the usual linguistic model structure shown in the previous Section 
present low accuracy when working with very complex systems. Due to this 
reason, we consider obtaining a new more flexible KB structure that allows us 
to improve the accuracy of linguistic models without losing their interpretability. 

In PI, an attempt was made to put this idea into effect first by designing a 
fuzzy model based on simplified TSK-type rules, i.e., rules with a single point in 
the consequent, and then transforming it into a linguistic model, which has to be 
as accurate as the former. To do so, they introduced a secondary KB, in addition 
to the usual KB, and proposed an Inference Engine capable of obtaining an 
output result from the combined action of both Fuzzy Rule Bases. Hence, what 
the system really does is to allow a specific combination of antecedents to have 
two different consequents associated, the first and second in importance, thus 
avoiding some of the said problems associated to the linguistic rule structure. 

Taking this idea as a starting point, we allow a specific combination of an- 
tecedents to have two consequents associated, the first and second in importance 
in the fuzzy input subspace, but only in those cases in which it is really necessary 
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to improve the model accuracy in this subspace, and not in all the possible ones 
as in H. Therefore, the existence of a primary and a secondary Fuzzy Rule Base 
is avoided, and the number of rules in the single KB is decreased, which makes 
easier to interpret the model. 

These double-consequent rules will locally improve the interpolative reason- 
ing performed by the model allowing a shift of the main labels making the final 
output of the rule lie in an intermediate zone between the two consequent fuzzy 
sets. They do not constitute an inconsistency from a Linguistic Modelling point 
of view due to the fact that they have the following interpretation: 



IF x\ is A\ and . . . and Xn is An THEN y is between Bi and B 2 



Other advantages of our approach are that we do not need the existence of 
a previous TSK fuzzy model and that we work with a classical fuzzy Inference 
Engine. In this contribution, we shall use the Minumum t-norm in the role of 
conjunctive and implication operator (although any other fuzzy operator may 
be considered for either of the two tasks). The only restriction is to use any 
defuzzification method working in mode B and considering the matching degree 
of the rules fired. We shall work with the Centre of Gravity weighted by the 
matching degree |21, whose expression is shown as follows: 



2/0 = 



■ y^ 



with T being the number of rules in the KB, hi being the matching degree 
between the ith rule and the current system input (see Section EJ and yi being 
the centre of gravity of the fuzzy set inferred from that rule. 



3.2 A New Way to Generate Fuzzy Rules for Globally Improving 
the Cooperation Between Them 

The previous point deals with the local improvement of the fuzzy reasoning 
accuracy in a specific fuzzy input subspace. On the other hand, the second 
aspect deals with the cooperation between the rules in the KB, i.e., with the 
overlapped space zones that are covered by different linguistic rules. As is known, 
the generation of the best fuzzy rule in each input subspace does not ensure that 
the FRBS will perform well due to the fact that the rules composing the KB may 
not cooperate suitably. Many times, the accuracy of the FRBS may be improved 
if other rules different than the primary ones are generated in some subspaces 
because they cooperate in a better way with their neighbour rules. 

Hence, we shall consider an operation mode based on generating a prelimi- 
nary fuzzy rule set composed of a large number of rules, which will be single or 
double-consequent ones depending on the complexity of the specific fuzzy input 
subspace — no rules will be generated in the subspaces where the system is not 
defined — . Then, all these fuzzy rules will be treated as single-consequent ones 
(each double-consequent rule will be decomposed in two simple rules) and the 
subset of them with best cooperation level will be selected in order to compose 
the final KB. 
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3.3 Some Important Remarks about ALM 

We may draw two very important conclusions from the assumptions made in 
the previous subsections. On the one hand, it is possible that, although the 
preliminary fuzzy rule set generated has some double-consequent rules, the final 
KB does not contain any rule of this kind after the selection process. In this case, 
the linguistic model obtained has taken advantage of the way in which the fuzzy 
rules has been generated because many rule subsets with different cooperation 
levels have been analysed. This is why it will present a KB composed of rules 
cooperating well, a fact that may not happen in other inductive design methods, 
such us Wang and Mendel’s (WM-method) m and the Explorative Generation 
Method (EGM) ^ — an adaptation of Ishibuchi et al’s fuzzy classification rule 
generation process 0 able to deal with rules with linguistic consequents — both 
of which are based on directly generating the best consequent for each fuzzy 
input subspace. 

On the other hand, it is possible that the KB obtained presents less rules 
than KBs generated from other methods thanks to both aspects: the existence of 
two rules in the same input subspace and the generation of neighbour rules with 
better cooperation may mean that many of the rules in the KB are unnecessary 
to give the final system response. These assumptions will be corroborated in 
view of the experiments developed in Section El 

4 A Linguistic Modelling Process Based on ALM 

Following the assumptions presented in the previous Section, any design process 
based on ALM will present two stages: a preliminary linguistic rule generation 
method and a rule seleetion method. The composition of both stages in the learn- 
ing process presented in this contribution, which takes as a base the WM-method, 
is shown in the next two subsections. Another ALM process based on the EGM 
is to be found in Ej. 

4.1 The Linguistic Rule Generation Method 

Let E be an input-output data set representing the behaviour of the system 
being modelled. Then the RB is generated by means of the following steps: 

1. Consider a fuzzy partition of the input variable spaces: It may be obtained 
from the expert information — if it is availaible — or by a normalization 
process. In this paper, we shall work with symmetrical fuzzy partitions of 
triangular membership functions (see Figure EJ. 

2. Generate a preliminary linguistic rule set: This set will be formed by the rule 
best covering each example — input-ouput data pair — contained in E. The 
structure of the rule Ri = IF X\ is A\ and . . . and Xn is THEN y is Bi 
generated from the example e/ = {x{, . . . ,x[^,y^) is obtained by setting each 
rule variable to the linguistic label associated to the fuzzy set best covering 
every example component. 
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Fig. 2. Graphical representation of the type of fuzzy partition considered 



3. Give an importance degree to each rule: The importance degree associated 
to Ri will be obtained as follows: 

G{Ri) = p.A\i.x\) ■ 

4. Obtain a final RB from the preliminary linguistic rule set: This step is the 
only one differing from the original WM-method. Whilst in that method 
the rule with the highest importance degree is the only one chosen for each 
combination of antecedents, in our case we allow the two most important 
rules in each input subspace — if they exist — to form part of the RB. 

Of course, a combination of antecedents may have no rules associated (if 
there are no examples in that input subspace) or only one rule (if all the ex- 
amples in that subspace generated the same rule). Therefore, the generation 
of double-consequent rules is only addressed when the problem complexity, 
represented by the example set, shows that it is necessary. 

4.2 The Rule Selection Genetic Process 

In order to obtain a final KB composed of rules cooperating well and to achieve 
that more than a single rule is used only in those zones where it is really nec- 
essary, we shall use a rule selection process with the aim of selecting the best 
subset of rules from the initial linguistic rule set. 

The selection of the subset of linguistic rules best cooperating is a combinato- 
rial optimization problem im. Since the number of variables involved in it, i.e., 
the number of preliminary rules, may be very large, we consider an approximate 
algorithm to solve it, a Genetic Algorithm (GA) p|. However, we should note 
that any other kind of technique can be considered without any change in ALM. 
Our rule selection genetic process 0 is based on a binary coded GA, in which 
the selection of the individuals is performed using the stochastic universal sam- 
pling procedure together with an elitist selection scheme, and the generation of 
the offspring population is put into effect by using the classical binary two-point 
crossover and uniform mutation operators. 

The coding scheme generates fixed-length chromosomes. Gonsidering the 
rules contained in the linguistic rule set derived from the previous step counted 
from 1 to T, a T-bit string C = (ci, ..., ct) represents a subset of candidate rules 
to form the RB finally obtained as this stage output, R®, such that. 

If Ci = 1 then Ri G R® else Ri ^ R® 
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The initial population is generated by introducing a chromosome representing 
the complete previously obtained rule set, i.e., with all Ci = 1. The remaining 
chromosomes are selected at random. 

As regards the fitness funtion, F{Cj), it is based on a global error measure 
that determines the accuracy of the FRBS encoded in the chromosome, which 
depends on the cooperation level of the rules existing in the KB. We usually 
work with the mean square error (SE), although other measures may be used. 
SE over the training data set, E, is represented by the following expression: 

' ' eieE 

where S{x^) is the output value obtained from the FRBS using the RB coded 
in Cj, when the input variable values are = {x \, . . . , xjj), and is the known 
desired value. 

5 Examples of Application 

With the aim of analysing the behaviour of the proposed ALM process, we have 
chosen two different applications: the fuzzy modelling of a three-dimensional 
function 0 and the problem of rice taste evaluation [2j. In both cases, we shall 
compare the accuracy of the linguistic models generated from our process with 
the ones designed by means of other methods with different characteristics: two 
methods based on generating the RB rule by rule, i.e., without considering the 
cooperation among linguistic rules — the one proposed by Nozaki et al. (N- 
method) in |0|, that has been mentioned in Section 0 and the simple WM- 
method — and another process based on working at the level of the whole KB 
— NEFPROX, the Neuro-Fuzzy approach proposed in [S]. 

5.1 Ftizzy Modelling of a Three-Dimensional Fhnction 

The expression of the selected function, the universes of discourse considered for 
the variables and its graphical representation are shown as follows. It is a simple 
unimodal function presenting two discontinuities at the points (0,0) and (1, 1). 



F{xi,X2) = 10 ■ 



Xi-2XiX2-\-X2 ’ 

Xi,X2 G [0, l],F(a:i,a; 2 ) G [0,10] 



In order to model this function, a training data set composed of 674 data uni- 
formly distributed in the three-dimensional definition space has been obtained 
experimentally. On the other hand, another set composed of 67 data (a ten per- 
cent of the training set size) has been randomly generated for its use as a test 
set for evaluating the performance of the design methods. Of course, the lat- 
ter set is only emploied to measure the generalization ability of the generated 
model, i.e., it is not considered in the learning stage. The DB used for all design 
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Fig. 3. Graphical representation of the function considered 



methods is constituted by three normalised fuzzy partitions formed by seven 
triangular-shaped fuzzy sets (as shown in Fig. The linguistic term set consid- 
ered is {ES,VS, S, M, L,VL, EL}, standing E for Extremely, V for Very, and 
S, M , and L for Small, Medium and Large, respectively. Finally, the parame- 
ters considered for the rule selection genetic process are: Number of generations: 
500, Population size: 61, Crossover probability: 0.6 and Mutation probability: 
0.1 (per individual). 

The results obtained in the experiments developed are collected in Table 
[D where ffR stands for the number of simple rules of the corresponding KB, 
and SEtra and SEtst for the values obtained in the SE measure computed over 
the training and test data sets, respectively. As may be observed, the results 
obtained by our process after each stage, generation and selection, are included. 



Table 1. Results obtained in the fuzzy modelling of the selected function 







Generation 


Selection 


Method 


#R 


ECtra ECtst #R 


ECtra ECtst 


N-method 


98 


0.175382 0.061249 - 


— — 


WM-method 


49 


0.194386 0.044466 - 





NEFPROX 


49 


0.505725 0.272405 - 


— — 


ALM 


88 


0.220062 0.146529 55 


0.019083 0.026261 



In view of these results, we should underline the good behaviour presented by 
our ALM process, that generates the most accurate model in the approximation 
of the training and test sets. As regards the number of rules in the KBs, we 
should note that our linguistic model only presents a few more rules than the 
ones generated from the WM-method and from NEFPROX. As shown in TableEl 
by only adding eight new rules (and by removing two more) to the KB generated 
by means of the WM-method, a significantly more accurate model is obtained 
with a very small loss of interpretability (as mentioned, this KB only contains 
eight double-consequent rules). On the other hand, our model is more accurate 
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to a high degree than the N-method one, presenting a very much simpler KB 
(55 rules against 98). 



Table 2. Decision tables for the linguistic models obtained for the selected 
function by means of the WM-method (left) and our ALM process (right) 

X2 X2 
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5.2 Rice Taste Evaluation 

Subjective qualification of food taste is a very important but difficult problem. 
In the case of the rice taste qualification, it is usually put into effect by means of 
a subjective evaluation called the sensory test. In this test, a group of experts, 
usually composed of 24 persons, evaluate the rice according to a set of charac- 
teristics associated to it. These factors are: flavor, appearance, taste, stickiness, 
and toughness |Sj. 

Because of the large quantity of relevant variables, the problem of rice taste 
analysis becomes very complex, thus leading to solve it by means of modelling 
techniques capable of obtaining a model representing the non-linear relationships 
existing in it. Moreover, the problem-solving goal is not only to obtain an accu- 
rate model, but to obtain a user-interpretable model as well, capable of putting 
some light on the reasoning process performed by the expert for evaluating a 
kind of rice in a specific way. Due to all these reasons, in this Section we deal 
with obtaining a linguistic model to solve the said problem. 

In order to do so, we are going to use the data set presented in |S|. This set is 
composed of 105 data arrays collecting subjective evaluations of the six variables 
in question (the five mentioned and the overall evaluation of the kind of rice), 
made up by experts on this number of kinds of rice grown in Japan (for example, 
Sasanishiki, Akita-Komachi, etc.). The six variables are normalised, thus taking 
values in the real interval [0, 1]. 

With the aim of not biasing the learning, we have randomly obtained ten 
different partitions of the said set, composed by 75 pieces of data in the training 
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set — to generate ten linguistic models in each experiment — and 30 in the test 
one — to evaluate the performance of the generated models — . To solve the prob- 
lem, we use the same Linguistic Modelling processes considered in the previous 
Section. The values of the parameters of the rule selection genetic process are 
the same ones considered in that Section as well. 

As was done in 0, we have worked with normalised fuzzy partitions (see 
Fig. 0) composed of a different number of linguistic labels for the six variables 
considered — two and three, to be precise — . The results obtained in the exper- 
iments developed are collected in Tabled The values shown in columns SEtra 
and SEtst have been computed as an average of the SE values obtained on the 
training and test data sets, respectively, by the ten linguistic models generated 
in each case. The column stands for the number of labels considered in the 
fuzzy partitions in each experiment and stands for the average number of 
linguistic rules in the KBs of the models generated from each process. 



Table 3. Results obtained in the rice taste evaluation 







Generation 




Selection 


#L Method 


#R 


ECtra ECtst 


#R 


ECtra ECtst 




N-method 


64 


0.00862 0.00985 


- 


— — 


2 


WM-method 


15 


0.01328 0.01311 


- 


— 




NefProx 


15 


0.00633 0.00568 


- 


— — 




ALM 


19.8 


0.02192 0.02412 


5 


0.00341 0.00398 




N-method 


364.8 0.00251 0.00322 


- 


— . 


3 


WM-method 


23 


0.00333 0.00375 


- 


— — 




NefProx 


32.2 


0.00338 0.00644 


- 


— — 




ALM 


25.7 


0.00595 0.00736 12.2 0.00185 0.00290 



From an analysis of these results, we may again note the good behaviour 
presented by the proposed ALM process. The linguistic models generated from 
it clearly outperform the ones designed by means of the other processes in the 
approximation of both data sets (training and test) in the two experiments 
developed (using 2 and 3 labels in the fuzzy partitions) . On the other hand, even 
following the approach of double-consequent generation proposed in Section El 
our process generates the KBs with less rules, thus making the corresponding 
models simpler to be interpreted. In fact, none of the 20 KBs generated finally 
presents double-consequent rules due to the action of the selection process. 

6 Concluding Remarks 

In this paper, ALM has been proposed, that is a new approach to design linguistic 
models in the field of Intelligent Data Analysis, which are accurate to a high 
degree and suitably interpretable by human-beings. An ALM process has been 
introduced as well, and its behaviour has been compared to other Linguistic 
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Modelling techniques in solving two different problems. The proposed process 
has obtained very good results. 

This leads us to conclude that, as mentioned in Section lO our process has 
the capability of distinguishing the unnecesary rules and of generating KBs with 
good cooperation. The ALM operation mode based on: a) generating a prelimi- 
nary fuzzy rule set with a large number of rules — considering double-consequent 
ones if it is necessary — and b) selecting the subset of them cooperating best al- 
lows us to obtain good results in the area of Linguistic Modelling. 
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Abstract. It was previously argued that Decision Tree learning algo- 
rithms such as CART or C4.5 can also be useful to build small and 
accurate Decision Lists. In that paper, we investigate the possibility of 
using a similar “top-down and prune” scheme to induce formulae from a 
much different class: Decision Committees. A decision committee con- 
tains rules, each of which being a couple (monomial, vector), where 
the vector’s components are highly constrained with respect to clas- 
sical polynomials. Each monomial is a condition that, when matched 
by an instance, returns its vector. When each monomial is tested, the 
sum of the returned vectors is used to take the decision. Decision Trees, 

Lists and Committees are complementary formalisms for the user: while 
trees are based on literal ordering, lists are based on monomial ordering, 
and committees remove any orderings over the tests. Our contribution 
is a new algorithm, WIDC, which learns using the same “top-down and 
prune” scheme, but building Decision Committees. Experimental results 
on twenty- two domains tend to show that WIDC is able to produce 
small, accurate, and interpretable decision committees. 

1 Introduction 

The ability to choose the output type (decision trees, lists, etc.) of an induction 
algorithm is crucial, at least for two reasons. First, the interpretation of the out- 
put depends on the user’s natural preferences, and it may be easier using specific 
types of concept representations. Second, the problem adressed by the algorithm 
usually favors efficient encoding on some formalisms (e.g. linear separators) but 
not on others (e.g. ordinary decision trees). As an example |1 2] quote that 

“the clients [business users] found some interesting patterns in the deci- 
sion trees, but they did not feel the structure was natural for them. They 
were looking for those two or three attributes and values (e.g. a combi- 
nation of geographic and industries) where something “interesting” was 
happening. In addition, they felt it was too limiting that the nodes in a 
decision tree represent rules that all start with the same attributes.” 



D.J. Hand, J.N. Kok, M.R. Berthold (Eds.): IDA’99, LNCS 1642, pp. 27-|2BI 1999. 
(c) Springer- Verlag Berlin Heidelberg 1999 
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Because classical induction algorithms provide only one, or very few choices of 
output types, the satisfaction of the user’s preferences and/or the search for small 
(thus, interpretable) encodings are solved by browsing through the outputs of 
various algorithms. However, it is actually hard to compare such outputs for they 
can be built using different schemes, paradigms or principles. The usual error 
minimization principle is not always the only goal an induction algorithm has 
to realize, and various other parameters can be taken into account: time com- 
plexity, size restrictions over the output, vizualization requirements, by-classes 
errors, etc. As an example, consider CN2’s decision list output 0, and the de- 
cision trees obtained from C4.5 m- It has been remarked that CN2’s outputs 
are on average much bigger than the decision lists equivalent to C4.5’s outputs 
m- This contradicts a priori observations on the expressive power of decision 
lists pnjE|; in fact, the two algorithms are designed much differently from each 
other, and they optimize quite different criteria. 

A previous study j l tifflTj establishes that the particular induction scheme of C4.5 
can be used to build decision lists, which in turn gives quite accurate compar- 
isons with C4.5’s decision trees, in the light of their theoretical relationships. In 
that paper, we propose to extend their general “top-down and prune” induction 
scheme to a much different class: decision committees (DC). DC is the Boolean 
multiclass extension of polynomial discriminant functions. A decision committee 
contains rules, each of these being a couple (monomial, vector). Each monomial 
is a condition that, when fired, returns its vector. After each monomial has been 
tested, the sum of the returned vectors is used to take the decision. This additive 
fashion for combining rules is absent from classical Boolean classifiers such as 
Decision Trees (DT) or Decision Lists (DL). Furthermore, unlike these two latter 
classes, the classifier contains absolutely no ordering, neither on variables (unlike 
DT), nor on monomials (unlike DL). When sufficiently small DCs are built and 
adequate restrictions are taken, a new dimension in interpreting the classifier is 
obtained, which does not exist for DT or DL. Namely, any example can satisfy 
many rules, and a DC can therefore be interpreted by means of various rule 
subsets. Decision committees share a common feature with decision tables m 
they are all voting methods. However, decision tables classifiers are based on 
majority votings of the examples (and not of rules), over a restricted “window” 
of the description variables. They necessitate the storing of many examples, and 
the interpretations of the data can only be made through this window, according 
to this potentially large set of examples. Decision committees rather represent 
an efficient way to encode a large voting method into a small number of rules. 



The algorithm we present is called WIDC, which stands for “Weak Induction 
of Decision Committees”. It has the following key features. It uses recent results 
on partition boosting, ranking loss boosting [221 and some about pruning uni- 
second, it represents a weak learning algorithm as defined by 1221 , rather than 
a boosting algorithm as defined by I22|. In particular, no modification is made 
on the example’s distribution, similarly to C4.5. Additionally, AdaBoost’s pro- 
cedure adaptated to generate polynomials is not suited to calculate such con- 
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strained vectors as for DC. Finally, on multiclass problems, when it can be 
assumed that each example belongs to one class, WIDC proposes a polynomial 
solution for multiclass induction which is faster than that proposed by |‘22| . In 
addition, it avoids a NP-Hardness conjecture of m- 

This paper is organized as follows: in the following section, we present some 
definition about DC. We then proceed by a detailed exposition of each stage 
of the algorithm WIDC. In the last section, we present experiments that were 
carried out using WIDC, on twenty-two problems, most of which can be found 
on the UCI repository P). 

2 Definitions 

Let c be the number of classes. An example is a couple (o, Co) where o is an 
observation described over n variables, and Co its corresponding class among 
{1, 2, ..., c} ; to each example (o, Co) is associated a weight u>((o, Co)), representing 
its appearance probability with respect to a learning sample LS which we dispose 
of. LS is itself a subset of a whole domain which we denote by ft. A decision 
committee contains two parts: 

— A set of unordered couples (or rules) where each ti is a monomial 

(a conjunction of literals) over {0, 1, *}" (n being the number of description 
variables), and each Vi is a vector in (in the two-classes case, we add a 
single number rather than a 2-component vector). 

— A Default Vector D in [0, l]'^. Again, in the two-classes case, it is sufficient 
to replace D by a default class in {-f, — }. 

For any observation o and any monomial ti, the proposition “o satisfies ti' is 
denoted hy o ^ U. The opposite proposition “o does not satisfy ti" is denoted 
by “o ti- The classification of any observation o is made in the following 
way: define Vo as follows Vo = J 2 {ti vi)\o^ti'^i- class assigned to o is 
then argmaxi<j<c Vo if | argmaxi<j<c Vo| = 1, and argmaxjgargmaxi<j<, D 
otherwise. In other words, if the maximal component of Vo is unique, then the 
index gives the class assigned to o. Otherwise, we take the index of the maximal 
component of D corresponding to the maximal component of Vo- 

Figure E presents the example of a simple decision committee having three 
rules, with n = 4. Let X 1 X 2 X 3 X 4 be some observation. Since it satisfies the first 
and the third monomial, its corresponding vector has respective components 2, 
— 1 and 0. The greatest value is obtained for class ci, which gives the class of 
the observation. Let x{x 2 X 3 Xi be another observation. Its corresponding vector 
has respective components 0, —1, 0, which makes the default vector to assign a 
class. The largest value of the default vector among 0.2 and 0.5 is that of class 
C3, which gives the class of the observation. 

Let DC denote the whole class of decision committees. DC contains a subclass 
which is among the largest to be PAC-learnable fSli however this class is less 
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Fig. 1. A simple decision committee. 



interesting from a practical viewpoint since rules can be numerous and hard to 
interpret. However, a subclass of decision committees HS| presents an interesting 
compromise between representational power and interpretability power. In this 
class, which is used by WIDC, each of the vector components are restricted to 
{ — 1,0,1} and each monomial is present at most once. This subclass, to which 
we relate to as DC{_i g ij, contains the example of figure^ it suffers the same 
algorithmic drawbacks as decision trees 0 and decision lists P|: the construc- 
tion of small formulae with sufficiently high accuracy is hard uni In the case of 
DC{_i^o^i}, this difficulty even comes from two sources: the size limitation, and 
the restriction over the vectors. 

3 Overview of WIDC 

propose an algorithm, IDC, for building decision committees, which proceeds 
in two stages. The first stage builds a potentially large subset of different rules, 
each of which consists in a DC{_i g ij with only one rule. Then, in a second 
stage, it clusters gradually the decision committees, given that the union of two 
DC{_i_g^i}S with different rules is still a DC{_i^g,i}. At the end of this procedure, 
the user obtains a population of DC, in which the most accurate one is chosen 
and returned. Results proved that IDC is efficient to build small DCs. In that 
paper, we provide an algorithm from learning decision committees which has a 
different structure since it builds only one DC ; it shares common features with 
algorithms such as C4.5 CART P, ICDL More precisely, WIDC is a 
three-stages algorithm. It first build a set of rules derived from the results of 
on boosting decision trees. It then calculates the vectors using a scheme derived 
from Ranking loss boosting | 22 |. It then prunes the final DC{_i g ij using two 
possible schemes: pruning using the local convergence results and formulae of 
PH, to which we relate as “optimistic pruning” , or a more conventional pruning 
to which we relate as “pessimistic pruning” . The default vector is always chosen 
to be the observed distribution of ambiguously classified examples. 

While it shares many common features with the “C4. 5-family” of induction 
algorithms, WIDC is much different from the algorithms belonging to the so- 
called “AQ-family” |HI ITT?] . At least two fundamental parts of the induction 
are concerned. First, the goal itself is different. WIDC grows directly an overall 
concept for all classes instead of building a set of rules for each. Second, the main 
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induction step, growing a single rule, is conducted from an empty, unconstrained 
rule, instead of being driven from a single observation. 

3.1 Building a Large DC Using Partition Boosting 

Suppose that the hypothesis (not necessarily a DC, it might be e.g. a deci- 
sion tree) we build realizes a partition of the domain X into disjoint subsets 
X\,X 2 ^ Fix as [tt] the function returning the truth value of a predicate 

7T. Define 

1U|’' = ^ w{{o,Co))l{o,Co) & Xj ^Co = ll 

(o,Co)gI/5 

^ w{{o,Co))l{o,Co) G Xj ACo^lj 
{o,Co)GLS 

In other words, represents the fraction of examples of class I present in 
subset Xj, and represents the fraction of examples of classes yf I present in 
subset Xj. According to ( 221 , ^ weak learner should optimize the criterion: 

J I 

In the case of a decision tree, the partition is the one which is built at the leaves 
of the tree dBI ; in the case of a decision list, the partition is the one which is 
built at each rule, to which we add the subset associated to the default class m- 
Suppose that we encode the decision tree in the form of a subset of monomials, 
by taking for each leaf the logical-A of all attributes from the root to the leaf. 
All monomials are disjoint, and measuring Z over the tree’s leaves is equivalent 
to measure Z over the partition realized by the set of monomials. However, due 
to the disjointness property, only t subsets can be realized with t monomials, or 
equivalently with a tree having t leaves. 

Suppose that we generalize this observation by removing the disjointness con- 
dition over the monomials. Then a number of subsets of order C>(2‘) can now 
be realized with only t monomials, and it appears that the number of realized 
partition can be exponentially larger using DC than DT. However, the expected 
running time is not bigger when using DC, since the number of partitions is in 
fact bounded by the number of examples \LS\, where |.| denotes the cardinality. 
Thus, we may expect some gain in the size of the formula we build, which is of 
interest to interpret the classifier obtained. 

Application of this principle in WIDC is straightforward: a large DC is built 
by growing repetively, in a top-down fashion, a current monomial. In this mono- 
mial, the literal added at the current step is the one which minimizes the current 
Z criterion, over all possible addition of literals, and given that the new mono- 
mial does not exist already in the current DC (in order to prevent multiple 
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additions of a single monomial) . When no further addition of a literal decreases 
the Z value, a new monomial is created and initialized at 0, and then is grown 
using the same principle. When no further creation of a monomial decreases 
the Z, the algorithm stops and returns the current, large DC with still empty 
vectors. In the following step, WIDC calculates these vectors. 

3.2 Calculation of the Rule Vectors Using Ranking Loss Boosting 

m have investigated classification problems where the aim of the procedure is 
not to provide for some observation an accurate class. Rather, the algorithm 
outputs a set of values (one for each class) and we expect the class of the obser- 
vation to receive the largest value of all, thus being ranked higher than all others. 
The ranking loss represents informally the number of times the hypothesis fails 
to rank higher the class of an observation, against a class to which it does not 
belong. Suppose that each example (o, Co) is replaced by c — 1 new “examples” 
{o,y, Co) where y ^ Co, and renormalize the distribution of these new examples so 
that V(o, Co), '^o)) “ w{{o,Co)). Thus, the weight of an arbitrary, 

newly defined example, w{{o,lo,li)), is 0 whenever li ^ Co or Iq = h, and equals 
otherwise Take some monomial ti obtained from the large DC, and 

all examples satisfying it. We now work with this restricted subset of examples, 
while calculating Vi. Il22l propose a cost function which we should minimize in 
order to minimize the ranking loss. Adapted to our framework where the values 
of Vi are constrained to { — 1,0,1}, we should find the vector Vi minimizing 



(components of Vi are numbered from 0 to c — 1). j‘2‘2] conjecture that finding 
the optimal vector (which is similar to an oblivious hypothesis according to their 
definitions) is NP-Hard when c is not fixed, when each example can be element 
of more than one class, and when new coefficients are predetermined to “adjust” 
the values of Vi. In our more restricted case however, the problem admits a 
solution which is polynomial-time with respect to \LS\, n, c. In the two following 
subsections, we present the algorithm for undetermined c, and then an exact and 
fast procedure for the two-classes case. 

Polynomial Solution for Undetermined c We calculate the vector Vi of 
some monomial ti. We use the shorthands ..., to denote the sum 

of weights of the examples satisfying ti and belonging respectively to classes 
l,2,...,c (although not really relevant to the purpose of this paper, we could 
suppose that each example belongs to more than one class). We want to minimize 
Z as proposed above. Z can be rewritten as Z = with 
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and 



Ao,/i = Vi[lo - 1] - Vi[li - 1] 



Given only three possible values for each component of v, the testing of all 3'^ 
possibilities is exponential and time-consuming. But we can propose a polyno- 
mial approach. Suppose without loss of generality that < ... < 

otherwise, reorder the classes so that they verify this assertion. Then it comes 
VO < Iq < li < c — — 1] < Vi[li — 1], Thus, the optimal Vi does not 

belong to a set of cardinality 3'^, but to a set of cardinality 0{c^) which is easy 
to explore in quadratic time. The solution when there are two classes is much 
more explicit, as we now show. 

Explicit Solution in the Two-Classes Case For classical conventions, re- 
name Wf^ = and = W~ representing the fraction of examples from the 
positive and negative class respectively, satisfying ti. Due to the lack of space, 
we only present the algorithm. The proof is a straightforward minimization of 
Z in that particular case. 



The algorithm is a single-pass algoritm: each rule is tested only once, from the 
first rule to the last one. For each possible rule, a criterion Criterion(.) returns 
“TRUE” or “FALSE” depending on whether the rule should be removed or not. 
There are two versions of this criterion. The first one, to which we relate as 
“optimistic” , is derived from the recent work of m on pruning decision-trees. 
The second one, more classic and “pessimistic”, is strictly based on observed 
error minimization. 

Optimistic Pruning m present a novel algorithm to prune decision trees, 
based on the a test over local observed errors. By using m, lemma 1, we can 
obtain a similar test for DC, which seems however heuristic w.r.t. their general 
convergence results on DT. Its principle is the same : “can we compare, when 
testing some rule (U, Vi) and using the examples that satisfy the rule, the errors 
before and after removing the rule”? Name as the error before removing 

the rule, on the local sample LSp. satisfying monomial ti. Denote C0 as the 
error before removing (ti,Vi), still measured on the local sample LS(^ti,vi)- Then 
we define the “penalty” 




If ^ ^ then Vi = (0, -kl); 

If ^ ^ < Ve, then Vi = (0, 0); 

If Ve < ^ < eU then = (-1-1, 0); 

= (-kl, -I). 




3.3 Pruning a DC 




(Set(u) -k 2) log(n) + 21og l/i5 
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Set(?;) denotes the total number of literal of all other rules in the DC that an 
arbitrary example could satisfy. Remark that „ .) is fastly calculable, though 
it represents actually an combinatorial upperbound of the correct formula which 
would follow from lemma 1. The value of Criterion((ti, is therefore 
“TRUE” iff C(^ti,vi) + ^'(ti,vi) — ^0- 

Pessimistic Pruning The test simply returns “TRUE” iff the error after prun- 
ing the tested rule is strictly lower than after. Otherwise, the rule is left. 

4 Experimental Results 

Experiments were carried out using both pruning stages for WIDC. We ob- 
served that the algorithm obtained poor results with the optimistc pruning, 
because the penalty factor was too large, particularly on small datasets. Remark 
that the penalty factor tends to zero as „.)| tends to -l-oo. We have chosen 

to uniformly resample LS into a much marger subset of 5000 examples, when 
the initial LS contained less than 5000 examples. By this, we uniformly resample 
each problem so that all mimic domains with identical sizes, with which reason- 
able comparions may be made on pruning. 

Table Q] presents some results on datasets, which (except those marked by a 
“*”) were taken from the UCI repository of machine learning database |2|. Ex- 
periments were carried out on each domain by averaging results over a ten-fold 
stratified cross validation procedure ng. The results are presented for three types 
of execution of WIDC: with optimistic pruning (o), with pessimistic pruning (p), 
and without pruning (0). The terminology used for naming datasets is that of 
classical machine learning studies jg E] • The database LEDeven is exactly that 
of the LED recognition problem (P, still with 10% noise over the attributes), ex- 
cept that the 10 classes over the digits are replaced by two (even/odd). Problem 
LEDeven-|-17 is LEDeven with 17 additional irrelevant attributes. Underlined 
numbers in error rates indicates better values. We did not underline better re- 
sults in sizes, for they are systematically in favor of WIDC (o). Column “Others” 
points out previously cited results, relevant to our study. 

Interpretation of table E by means of error comparisons gives a clear ad- 
vantage for WIDC with pessimistic pruning. Results obtained also compare 
favourably to previously published results for induction algorithms, building de- 
cision trees, decision lists or decision committees. But they are all the more 
significant as we compare sizes obtained for the corresponding errors. For the 
“Echo” domain, WIDC with pessimistic pruning beats improved CN2 by two 
points, but the DC obtained contains roughly eight times less literals than CN2- 
POE’s decision list. If we except “VoteO”, on all other problems, we outperform 
CN2-POE on both accuracy and size. Finally, on “VoteO”, note that WIDC with 
optimistic pruning is slightly outperformed by CN2-POE by 2.51%, but the DC 
obtained is fifteen times smaller than the decision list of CN2-POE. On many 
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Table 1. Experimental results using WIDC(least errors for WIDC are under- 
lined for each problem). 



Domain 


WIE 
err (%) 


)C (c 
tdc 


>) 

Idc 


WIE 
err (%) 


)C (j 
roc 


>) 

Idc 


WIE 
err (%) 


)C (C 
Tdc 


)) 

Idc 


Others 


Balance 


22.38 


4.1 


10.5 


13.33 


16.4 


39.9 


14.29 


18.7 


44.9 


32.1 (c) 


Breast-W 


7.46 


1.1 


4.5 


4.08 


5.4 


22.8 


6.90 


7.7 


29.3 


4.0 (c) 


Echo 


32.14 


1.8 


3.9 


30.71 


1.9 


4.3 


31.42 


24.6 


38.8 


32.335.4 (a) 


Glass2 


22.94 


6.3 


16.9 


20.00 


9.8 


24.5 


22.35 


11.2 


27.1 


20.3 (c) 


Heart-Statlog 


24.07 


3.1 


8.9 


19.26 


9.1 


32.7 


21.85 


12.5 


40.8 


21.5 (c) 


Heart-C 


22.90 


2.9 


9.1 


21.93 


8.4 


30.1 


25.48 


13.3 


46.2 


22.5s2.o (a) 


Heart-H 


22.67 


3.9 


10.9 


20.33 


9.2 


28.2 


20.00 


14.3 


43.5 


21.8eo.3 (a) 


Hepatitis 


20.59 


3.4 


8.7 


19.41 


8.0 


19.0 


15.29 


11.4 


26.7 


19.234.0 (a) 


Iris 


5.33 


1.9 


4.6 


5.33 


2.9 


7.1 


20.67 


3.7 


7.9 


8.5 (c) 


Labor 


15.00 


2.9 


5.0 


15.00 


3.7 


6.6 


16.67 


3.8 


6.7 


16.316.8 (d) 


Lung 


42.50 


1.3 


3.8 


42.50 


2.6 


7.1 


42.50 


2.7 


7.2 


46.9 (e) 


LED? 


31.09 


6.9 


8.4 


24.77 


18.1 


24.0 


24.73 


19.0 


25.4 


25.73i 2.2 (d) 


LEDeven * 


23.48 


5.4 


9.6 


16.88 


6.1 


10.8 


34.43 


12.1 


18.4 




LEDeven-bl7 * 


35.64 


3.8 


10.8 


21.78 


19.9 


55.8 


21.88 


21.5 


59.7 




Monkl 


15.00 


4.1 


9.5 


15.00 


5.2 


13.0 


15.00 


9.4 


17.9 


I6.665.0 (d) 


Monk2 


24.43 


9.0 


38.4 


21.64 


18.6 


66.5 


31.80 


24.8 


82.1 


29.3918.0 (d) 


Monks 


3.04 


3.6 


4.8 


10.00 


4.9 


9.1 


12.5 


9.3 


12.3 


2.672.0 (d) 


Pima 


29.61 


2.2 


5.9 


25.97 


9.0 


34.7 


32.99 


22.2 


68.9 


25.9 (c) 


VoteO 


6.81 


1.9 


3.0 


8.86 


6.6 


13.5 


10.00 


9.5 


18.9 


4.349.6 (a) 


Votel 


10.90 


2.0 


3.5 


9.95 


9.5 


20.6 


12.5 


13.6 


29.7 


10.896.4 (d) 


Waveform 


30.49 


4.8 


8.2 


23.33 


7.8 


19.0 


20.24 


40.1 


65.0 


33.521.8 (b) 


XD6 


22.29 


5.4 


13.0 


18.80 


10.3 


27.0 


35.7 


37.1 


67.1 


22.O614.8 (f) 



• Conventions: 

roc is the number of rules of the DC, 

I DC is the overall number of literals of the DC (if a literal is present k times, it is 
counted k times). 



• References for “Others” : 

(a) 0 , improved CN2 (CN2-POE) version, the small number indicates the size (number 
of literals) of the decision list, the large one indicates the error rate. 

(b) m, decision lists learning algorithm ICDL, notations follow (a). 

(c) El, C4.5’s error. 

(d) ESI, DC learning algorithm IDC, notations follow (a). 

(e) 0, nearest neighbors error. 

(f) m, best reported result on DT induction. The small number indicates the number 
of leaves (equiv. to a number of monomials). 
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problems, such a size reduction would be well worth the slight loss in accuracy, 
because we keep much informations on very small classifiers. This represents the 
main point of our experiments, and we go on further on explaining why it can 
be achieved by DC. Our main argument relies on the compromise between the 
formalism power m and the sizes of the experimental DC’s we obtain. Consider 
the example of figure 0 obtained on the noisy problem LEDeven by WIDC with 
optimistic pruning. It is a very small DC achieving near-optimal prediction for 
the problem (13.04%). A LED is represented using seven binary descriptors. The 
fourth one is the one which is “off” on the “9” digit, and the fifth one is the 
lowest which is “on” on the “1” digit. Over the ten non-noisy digits, rule one is 
satisfied by all odd numbers, and by the “4” digit only (thus making an error). 
It is the rule having the highest correlation with the classes. Over the non-noisy 
digits, the second rule is only satisfied by the digit “2” . Because the problem has 
10% noise in the attributes, this rule has the particularity that any noisy odd 
digit that does not satisfy the first rule has a very low probability of satisfying 
this rule. In particular, it is much lower than the probability that a noisy even 
digit satisfy the second rule. Given the noise in the problem, this second rule 
somewhat “corrects” and adjusts the prediction of the first one. However, both 
rules can be interpreted independently. Though simple, this example shows how 
simple interpretations can be carried out using decision committees. 



monomials 


even 


odd 


Xa 


-1 


1 


X5 


1 


-1 


default D 


0.55 


0.45 



Fig. 2. An example of DC obtained on the problem LEDeven. 



Consider now the problem XD6. In this problem, each example has 10 bi- 
nary variables. The tenth is irrelevant in the strongest sense 0. The target 
concept is a 3-DNF over the first nine variables: (a;o A xi A X2) V (xs A X4 A 
X 5 ) V (xe A x-j A x^). Such a formula is typically hard to encode using a de- 
cision tree. In our experiments with WIDC (o), we have remarked that the 
target formula itself is almost always an element of the classifier built, and the 
irrelevant attribute is always absent. On the “VoteO” and “Votel” domains, 
which consist in predicting the classes democrat and republiccin from a set 
of poll questions, we also observed constant patterns in the DC built. In par- 
ticular, problem “Votel” was built from “VoteO” by removing the most infor- 
mative literal, achieving itself ~ 5% errors. Even for “Votel” where classical 
studies often report errors over 12%, and almost never around 10% [Z], we ob- 
served on most of the runs a DC containing a rule wich could be translated as 
“If Not (adoption-of-the-budget-resolution) and el-sal vador-aid, then 
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republican”. With such rules, the pessimistic pruning version of WIDC pro- 
vided on average an error under 10%. 

5 Conclusion 

In this paper, we have presented a new algorithm for building voting proce- 
dures related to as Decision Committees. WIDC shares common features with 
the “C4. 5-family” of induction algorithms, or more generally with the “Weak- 
induction” framework I2I1E2I, which previously proved useful in building Deci- 
sion Trees or Decision Lists piiraiT7| . WIDC is a two-stages algorithm. A first 
stage uses recent results about Boosting to grow a large DC, which is pruned 
in a second stage to obtain a small formula. Experimental results tend to show 
that WIDC is able to build small and accurate formulae, whose interpretation 
is possible via single rules or directly via subsets of them. 
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Abstract. In this paper we propose a method for extracting clnsters in 
a popnlation of customers, where the only information available is the 
list of products bought by the individual clients. We use association rules 
having high confidence to construct a hierarchical sequence of clusters. 
A specific metric is introduced for measuring the quality of the resulting 
clusterings. Practical consequences are discussed in view of some exper- 
iments on real life datasets. 



1 Introduction 

The essence of clustering in databases is to identify homogeneous groups of 
objects based on the values of their attributes. Various clustering techniques have 
been proposed (e.g., IHITTil L In particular, in the database community several 
systems have been introduced for clustering of data in large databases, see |E|. 
One can distinguish two main classes of clustering techniques: partitional and 
hierarchical clustering. In partitional clustering (ITTTWn) objects are grouped 
into disjoint clusters such that objects in a cluster are more similar to each other 
than to objects in other clusters. For instance, the well-known AT-means and K- 
medoid methods determine K cluster representatives and assign each object to 
the cluster with its representative closest to the object in such a way that the 
sum of the distances between the objects and their representatives is minimized. 
Hierarchical clustering on the other hand is a nested sequence of partitions. In 
the bottom-up method, larger and larger clusters are built by merging smaller 
clusters, starting from atomic clustering where each object forms a cluster on its 
own. In the top-down method however one starts with one cluster containing all 
objects and constructs a subdivision of the cluster into smaller pieces, e.g., m- 
In this paper we introduce a top-down hierarchical clustering method for finding 
clusters in a population of customers, where the only information available is the 
list of products bought by the individual clients. The technique as well as the 
metric we use are tailored for this specific class of data. However, the technique 
gives satisfactory results also when applied to the more general problem of finding 
clusters in a set of itemsets consisting of a sequence of binary attributes. In HH 
some theoretical questions for this setup are addressed, such as computational 
complexity (NP-completeness) of possible embeddings in fc-dimensional spaces, 
and the associated clusterings. 

D.J. Hand, J.N. Kok, M.R. Berthold (Eds.): IDA’99, LNCS 1642, pp. 39-|^2l 1999- 
[fc Springer- Verlag Berlin Heidelberg 1999 
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The idea is to use association rules for mining clusters. These rules relate groups 
of customers, and are of the form “80% of the customers that buy products A, B 
and C, buy product D too” . Association rules (cf. |2|) are formalized by means of 
implication rules augmented with two parameters which describe their quality: 
the support which measures the frequencies of the products occurring in the rule, 
and the confidence which denotes the strength of the implication. A hierarchical 
clustering can be built using the following top-down method. First, association 
rules having support above a certain threshold are generated, using the efficient 
Apriori technique from P]. Next, the “best” association rule is selected, where 
the selection criterion may depend on the number of products occurring on 
the left-hand side of the rules, as well as on the confidence and support of the 
rules. Finally, a cluster is constructed consisting of all the customers buying 
the products that occur in the left-hand side of the rule. The data set is then 
modified by removing the elements of that cluster. This procedure is iterated 
until a suitable stopping criterion is satisfied. Note that a small threshold for 
the support may bias the search towards clusters containing few, but strongly 
related clients, whereas a high support threshold allows one to construct larger 
clusters sharing less products. 

In order to assess the effectiveness of this clustering method, we have conducted 
experiments on benchmark data sets from the literature, as well as on two real 
life datasets. The real life data sets contain different kinds of itemsets: in the 
first data set, items describe a small number of products, whereas in the second 
one items describe many products. The results of the experiments indicate that 
the success of this technique depends on the structure of the items in the data 
sets. If the set of possible products is large, whereas customers buy relatively few 
products, the association rules tend to be of lower confidence. The corresponding 
clustering may not be very informative in this case; however, the clusterings still 
make sense. On the other hand, if the customers buy (relatively) more products, 
for instance if the number of products is small, association rules tend to be 
more reliable — which holds for the implied clustering too. For measuring the 
quality of a cluster and comparing clusters, we introduce a metric for the space 
of customers which takes into account the specific structure of the itemsets. 

The paper is organized as follows. In Section 2 we give some terminology on 
association rules and introduce an appropriate metric. Section 3 is devoted to 
a description of our method. In Section 4 we present some results from experi- 
ments. We conclude with a discussion. 

2 Preliminaries 

In this section we define the terminology and concepts that are used throughout 
the paper. 

2.1 Discovering Association Rules 

Suppose that we have n customers and a set S consisting of m products. Every 
customer i buys a subset C 5 of these products. The only information that 
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is used for extracting regularities about the customers is the collection 5i, 52, 

. . . , Sn- An association rule is a rule of the form TZ ^ T for disjoint subsets TZ 
and T of S, where T ^ tti. Customer i is said to satisfy this rule if and only if 
TZU T C Si- The support of this rule is the number of customers that satisfy 
the rule, divided by the total number of customers. The confidence of a rule is 
the number of customers that satisfy it, divided by the number of clients i with 
TZ C Si (the confidence is set to zero if the denominator is zero) . li TZUT has k 
elements, we say that the association rule has order k. In this paper we restrict 
ourselves to association rules whose right-hand side contains only one element. 
Association rules having both large support and confidence can be constructed 
using simple algorithms. However, if the database is very large, efficient meth- 
ods are necessary (see |2p:-il4ll 4j ). like the well-known Apriori algorithm. This 
algorithm is based on the construction of subsets of S that are present in many 
customers, joining them to find ever larger subsets. These subsets are called 
large fc-itemsets where k denotes their cardinality. Given minimum threshold 
s% for the support, the algorithm starts by constructing 1-itemsets having sup- 
port greater than or equal to s%. Then a large {k + l)-itemset is generated by 
merging two fc-itemsets having exactly (fc — 1) elements in common, and checking 
that its support is greater than or equal to s%. 

Once all large itemsets are generated, association rules can be easily derived 
as follows: suppose {A, B, C, D} and consequently {A, B, C} are large itemsets, 
then the rule A,B,C => D (note that for simplicity we omit { and }) can be 
derived. Clearly, any fc-itemset gives rise to k association rules of order k. 



2.2 A Metric for the Space of Customers 



We turn the space of customers (i.e., subsets of S) into a metric space by means 
of the following distance measure. For subsets TZ and T of 5 we define 



d{TZ,T) 



\TZ\T\ + \T\TZ\ 

|7^ur| -1- 1 



In this formula \ denotes the set-theoretic difference (A\3^ consists of those 
elements from X that are not in 3^), and \X\ denotes the number of elements of X . 
So the numerator is the number of elements in the symmetric difference of TZ and 
T. This is the well-known Hamming distance when the list of products bought 
by a client is characterized by means of a string of bits whose length is equal to 
m, and where the i-th entry is 1 if the customer bought the z-th product, and 0 
otherwise. The denominator in the definition of d is added in order to compensate 
for the size of the two sets. If for instance two customers differ in exactly one 
product, their distance is 2/ (A: -I- 3), where k is the number of products bought 
in common. So their distance decreases as the number of common purchases 
increases. This allows for judging the distance between customers also in terms 
of the number of products they bought. The -|-1 in the denominator of the 
formula for the distance d is added to deal with the case TZ = T = 0, but may 
also be omitted if one defines d(0, 0) = 0. This approach leads to almost the 
same metric. 
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Note that 0 < d{TZ,T) < 1 — l/(n + 1) < 1 for all subsets TZ and T of S. 
(If necessary the measure can be renormalized by multiplying it by a suitable 
fixed factor.) Of course d{TZ,T) = d{T,TZ) for all subsets TZ and T of S, and 
d{TZ,TZ) = 0. Finally the triangular inequality holds: 

d(TZ,T) < d(TZ,U) + d(U,T) 

for all subsets TZ, U and T of 5; this can be verified by some tedious calcu- 
lations. Indeed, put \TZ C\U C\T\ = a, \TZ C\ U C\ T\ = b, and so on; here T 
denotes the complement of the subset T in S. Now substitute these numbers in 
the inequality to be proved, remove the denominators and carefully check the 
remaining abundance of terms. We may conclude that d is a metric on the space 
of customers. 

A cluster can be defined as a set of customers that are more near to each other 
than to clients in other clusters with respect to the distance d. Note that the 
construction of clusterings is biased towards customers buying many products. 
Therefore, when we encode the information of a customer as a string of bits, 
strings containing many ones are more liable to form a cluster. Also notice that 
the measure can be used for classification tasks, where the class of an item is 
defined to be the nearest cluster. 

3 Association Rules Infer Clusters 

Suppose that we have association rules of order 1, 2, 3 and so on. Now fix a 
minimum support threshold of say s%. We consider the association rules having 
highest possible order, since they represent dependency among a larger set of 
products. These rules can be obtained by considering only the largest /c-itemsets 
generated by the Apriori algorithm. The minimum support s has to be rather 
small for ensuring the existence of these rules. However, s should not be too 
small in order to avoid the generation of many rules which are satisfied by few 
customers. In the experiments we have conducted, the values of s have been 
chosen after tuning the algorithm on each specific dataset. 

Once the association rules of highest order have been generated, we select the one 
with the highest confidence; if there exist more rules attaining this maximum, 
we choose one of them in a random way. We refer to the selected rule as rule\ = 
{TZ\ 7i). Now all customers that bought products in TZ\ constitute duster 
This means to include into the cluster not only the customers that satisfy the 
rule, but also those customers that bought all products from TZ\ but not those 
from 7i- Because we consider rules having high confidence, it is expected that 
these extra customers are similar to those satisfying the rule. Next, we remove 
the customers occurring in dusteri from the original dataset. The process is 
iterated a suitable number of times, leading to a hierarchical clustering dusteri, 
duster 2, dusters, ■ ■ ■ The termination condition we have used in the experiments 
consists of stopping when either a maximum number of clusters is generated (this 
maximum is given as an input parameter) or when the generated association rules 
do not reach the minimum support threshold. 
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The algorithm is illustrated below: 

1 s := minimum support; i := 1; 

2 Data := { all objects in the dataset }; Clust := 0 ; 

3 while (not termination-condition) do 

4 H := { association rules over Data 

having maximum order and support > s }; 

5 best := rule from H with highest confidence; 

6 clustevi := { objects containing products in LHS(&est) }; 

7 Data := Data\clusteri ; Clust := Clust U {cluster i}] 

8 i := i+ 1] 

9 od 

Here LHS(&est) denotes the set of elements on the left-hand side of the associ- 
ation rule best. The core of the algorithm consists of the statements in lines 4 
and 5. In line 4 the association rules of maximum order are generated, by con- 
sidering the objects in the dataset Data. Note that Data is initially equal to the 
original dataset, but it becomes smaller and smaller at each iteration (line 7). 
The generated cluster is inserted into the actual clustering Clust (assuming that 
this is an ordered set with respect to insertion) and the process is repeated on 
the smaller dataset Data obtained by removing the objects within the cluster. 
In this way we obtain a partitioning of the set of customers into clusters. As the 
results of the experiments will show, the sequence of clusters has the property 
that the first clusters that are built are of good quality, while clusters that are 
generated later on may become less informative. The quality of the cluster is 
here only determined by considering the average distance of its elements and 
the confidence of the corresponding association rule. In this study we do not 
take into account other measures like the cluster diameter, i.e., the maximum 
distance between any two points of the cluster. 

4 Experiments 

For the experiments we used the so-called “Zoo Database”, the artificial “LED 
Database” (both available from the Internet, see EHI)? two real life datasets 
generated from actual shop sales. The first two data sets are used for illustrating 
the effectiveness of our simple method in the case of classification problems. 
The other two datasets are used for illustrating the usefulness of the method for 
finding regularities in larger real life datasets — our original goal. 

4.1 The Zoo Database 

The zoo database contains 15 boolean attributes for n = 101 animals. In ad- 
dition, there is one six- valued numeric attribute: the number of legs. For our 
algorithm to work it was necessary to turn this attribute into a boolean one, 
either by stating that “legs > 2 is equivalent to True” (or to False) or by in- 
troducing a boolean attribute for every possible numeric value. Here we choose 
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the second option, the first one leading to almost identical results. So we have 
TO = 21 “products”: the terminology customer-product has to be interpreted as 
animal-attribute in this section. The original dataset also contains a classifica- 
tion of the animals into seven classes, referred to as A (41 mammals), B (20 
birds), C (5 reptiles), D (13 fishes), E (4 amphibians), F (8 insects) and G (10 
molluscs). The mean distance within the entire dataset is 0.577, which indicates 
that the attributes have several dependencies. In our experiments we discard 
information about the class to which an animal belongs. 
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Fig. 1. Experimental results for the zoo database. 



Figure 1 shows the results of some experiments. We considered three runs with 
different minimum threshold for the support: 4%, 10% and 40%. The maximum 
number of clusters to search for was set to 10. In the column “class contents” we 
mention the classes of the animals in the clusters (between brackets the number 
of animals of each class within the cluster). We also included the mean Hamming 
distance, since it might be a better measure for this database — the number of 
ones being relatively high. 

The first and second run show that the classes are well separated; if the mean 
distance within the clusters gets higher, more classes may occur in the same 
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cluster. The hierarchical nature of the clustering is also apparent. Note that 
when the support threshold is high, not all items are clustered in the end: not 
even rules of type “=1> attribute” obtain this threshold anymore, just because the 
number of remaining animals is too small. The animals not clustered are exactly 
those from classes F and G. It would of course be possible to lower the threshold 
during the run, thus giving smaller clusters the opportunity to be discovered. 
The rules found were of high order and confidence; for cluster 1 in the first and 
second run (it happens to be the same rule) the rule has order 9, for cluster 2 
it has order 8 and 7, respectively. For the third run the first rule has order 3: 
“milk, breathes => backbone”; the second rule is: “backbone eggs”, so the 
animals not having a backbone remain unclustered. The mammal in cluster 6 of 
the second run is a platypus, the two mammals in cluster 7 are a dolphin and a 
porpoise; both classifications make some sense, at least for a non-biologist. The 
mean distances within the clusters are very small, see for instance cluster 3 in 
the first run. This also reveals some of the nature of the database. 



4.2 The LED Database 

The LED database is an artificial database, where each item corresponds with 
the “seven bit LED encoding” of one of the ten numbers 0,1, 2,..., 9. Noise 
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Fig. 2. Experimental results for the LED database. 
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is introduced by appending at the end of the original string (with seven bits) 
a sequence of fixed length consisting of randomly chosen bits, as well as by 
corrupting some of the bits in the original string. The LED encoding shows 
which LED’s out of seven are on in each case, for instance the topmost horizontal 
LED is activated for the numbers 0,2, 3, 5, 7, 8 and 9. This database is particularly 
interesting, since the noise may be added in such a way that there are lots of 
zeroes — a property also present in the real life data sets we use in the sequel. 
As a typical example (see Figure 2) we added 93 random bits, giving a total of 
m = 100 bits; these extra bits had a 90% probability of being zero. We generated 
n = 1000 strings. The first experiment shows a situation where the original seven 
bits are not corrupted, whereas in the second run these bits had a 7% chance 
of being toggled (so on average 50% of the encoded strings contains a flaw) . In 
the column “class contents” the classes of the strings, i.e., their numbers, in the 
clusters are given (between brackets the number of strings of each class within 
the cluster); a * denotes elements different from the majority within the class. 
Observe that some encodings are quite similar (for instance those of 0 and 8), 
giving understandable faults in case of noise. 

The mean distance within the entire dataset is 0.764 and 0.778, respectively. The 
mean distance within the clusters is easily understood for the first experiment: 
the clustered strings may only differ in the 93 last bits, 9.3 of which are 1 on 
average. The association rules found are of high order, due to the exact nature of 
the database. The ordering of the clusters is as expected: note that the fact that 
our algorithm is biased towards ones implies that encodings with many zeroes 
(e.g., that for the number 1) are only detected in the end. We may conclude 
that the algorithm is capable of discovering clusters that respect the original 
classification, as it did for the zoo database. 

4.3 Two Real Life Databases 

In contrast with the previously discussed databases, the real life datasets consid- 
ered here contain more customers, who may choose from many products. They 
are expected to show a less regular behaviour. Also, the division into classes is 
not known in advance. In all cases two or three of the biggest selling products 
were removed, since they do not contribute to the generation of interesting as- 
sociation rules. For example, in one of the datasets about 50% of the customers 
bought one particular product; this product is very likely to occur in a good 
association rule, but probably has not much discriminating ability. 

The first database has to = 7, 500 products, whereas the number of customers is 
of moderate size; we experimented with a subset consisting of n = 800 customers, 
buying 50 to 95 products each (sample Dl), and a subset consisting of n = 1, 400 
customers, buying 8 to 10 products each (sample D2). For Dl the mean number 
of products bought was 57.05, and the mean distance was 0.975, which is very 
high. For D2 these numbers were 8.36 and 0.939, respectively. 

The second database has to = 100 products, and we considered n = 10, 000 
customers (sample D3). It also contained purer data, i.e., there were less flaws 
present, probably because the products involved were more expensive. The mean 
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number of products per customer was 2.11, with mean distance 0.720. Sample 
D4 consisted of n = 10,000 other customers from the same database; here the 
mean number of products per customer was 2.35, with mean distance 0.733. 

In all cases the support threshold was taken to be 2%, and the number of clusters 
to find was bounded by 10. Only for D2 the support threshold was 0.5; larger 
values did not provide any significant clustering in that case. The results of the 
experiments are reported in Figure 3 and Figure 4. Except for the computation 
of the mean distance the runs took only a few minutes on a Pentium-based PC. 
The products that occur in the rules are arbitrarily named si, S 2 , . . . , sig and 
ti, t 2 , • • • , ti 3 for the first database, and pi, p 2 , • • • , _Pi 3 for the second one. 
Note that the products in the rule for clusters (sample D3) form a subset of 
those from cluster i. In fact, in this database the rule p 2 ^ Ps holds with high 
reliability. Since the algorithm tries to find association rules of the highest pos- 
sible order first, having fixed support threshold, this rule is superseded by the 
rule pi,P 2 P 3 that constitutes clusters. In this case two separate clusters are 
found. If the support threshold were such that it was not met by pi,P 2 P 3 , 
only one cluster, based on p 2 P 3 , would have resulted. This shows that human 
interference plays a crucial role in the process: the choice of the support thresh- 
old influences the clustering. In contrast, the triples {pi,P 2 ,P 3 } and {p 4 ,P 5 ,Pe} 
show some differences; the rule ps => pq has low confidence, and the clustering 
concerning these three products is not as clear as the one for {pi,P 2 ,P 3 }- 
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Fig. 3. Experimental results for the first real life database. 




48 



Walter A. Kosters, Elena Marchiori, and Ard A.J. Oerlemans 



sample 


cluster 


number of 
customers 


mean 

distance 


rule 


confidence 


D3 


1 


253 


0.484 


Pi , P2 => P3 


98% 




2 


337 


0.443 


P4,P5 =» P6 


89% 




3 


431 


0.411 


P2 P3 


99% 




4 


320 


0.372 


P7 => P8 


91% 




5 


370 


0.473 


P9 PlO 


72% 




6 


2102 


0.388 


PS => P6 


67% 




7 


6187 


0.679 


Pll 


20% 


D4 


1 


216 


0.455 


P4,P5,P12 => P6 


96% 




2 


202 


0.557 


P2,P13 => P3 


100% 




3 


312 


0.428 


Pi , P2 P3 


98% 




4 


485 


0.424 


P4,P5 P6 


90% 




5 


376 


0.450 


P7 P8 


93% 




6 


392 


0.426 


P9 => PlO 


85% 




7 


1253 


0.446 


P6 PS 


63% 




8 


6764 


0.676 


Pll 


26% 



Fig. 4. Experimental results for the second real life database. 



The last clusters do not seem to be of any importance. This holds in particular 
for cluster^, resp. clusters, which resulted from a rule with empty left-hand side 
and consequently very low confidence. Also note the relatively high mean dis- 
tance. No association rules of order 2 were present anymore, and the algorithm 
clusters all the remaining customers (remember that all customers buying the 
left-hand side, which is empty here, are clustered). Domain experts were ca- 
pable of interpreting the most significant clusters. The experiments with many 
customers show higher coherence within the clusters, reflected by lower mean 
distance and higher confidence. In all cases the rules found had low order, also 
due to the abundance of products to choose from. But even for the case with 
fewer customers and less products per customer the clusters found made sense. 

5 Conclusions 

In this paper we have proposed a simple method for mining clusters in large 
databases describing information about products purchased by customers. The 
method generates a sequence of clusters in an iterative hierarchical fashion, using 
association rules for biasing the search towards good clusters. We have tested 
this method on various datasets. The results of the experiments indicate that 
the technique allows one to And informative clusterings. 

As already mentioned, due to the hierarchical strategy employed to generate the 
clustering, it may happen that the cluster generated in the last iteration contains 
objects sharing few regularities. In this case, one can discard the last cluster and 
consider its objects as not belonging to the clustering, because they do not 
present enough regularities. Alternatively, one can redistribute the elements of 




Mining Clusters with Association Rules 



49 



the last cluster among the other clusters. For instance, a possible redistribution 
criterion can be the distance of the objects from the clusters, where an object is 
inserted in the cluster having minimal distance. More sophisticated techniques 
for redistributing objects of the last cluster can also be applied (e.g., 03). 
Clustering techniques have been studied extensively in the database community, 
yielding various systems such as CLARANS (Q2I) BIRCH (Q7|). These systems 
are rather general: they apply techniques imported from clustering algorithms 
used in statistics, like in CLARANS, or sophisticated incremental algorithms, like 
in BIRCH. It is not our intention to advocate the use of our clustering algorithm as 
an alternative for such systems. Nevertheless, our clustering algorithm provides 
a simple tool for mining clusters in large databases describing sales data. 
Several techniques based on association rules have been proposed for mining var- 
ious kinds of information. However, to the best of our knowledge, our method 
provides a novel use of association rules for clustering. Some related techniques 
based on association rules are the following. In 0 an algorithm for finding profile 
association rules is proposed, where a profile association rule describes associa- 
tions between customer profile information and behaviour information. In H21 
association rules containing quantitative attributes on the left-hand side and 
a single categorical attribute on the right-hand side are considered. A method 
for clustering these rules is introduced, where rules having adjacent ranges are 
merged into a single description. This kind of clustering provides a compact rep- 
resentation of the regularities present in the dataset. Finally, in 0 the use of 
association rules for partial classification is investigated, where rules describing 
characteristics of some of the data classes are constructed. The method gener- 
ates rules which may not cover all classes or all examples in a class. Moreover, 
examples covered by different rules are not necessarily distinct. 

In this paper, we restricted ourselves to a specific type of datasets where the 
objects are vectors of binary attributes. We intend to investigate the effectiveness 
of the clustering method when multivalued attributes as well as quantitative 
ones are used: this amounts to considering more expressive forms of rules, like 
for instance the so-called profile association rules (see 0). 

An interesting topic for future work is the analysis of the integration of our 
technique into more sophisticated clustering systems. For instance, we would 
like to analyze the benefits of our clustering algorithm when used for generating 
a “good” initial clustering of the data that could be subsequently refined, either 
by means of iterative methods in the style of those from 0, or by means of 
methods based on evolutionary computation like genetic algorithms. 
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Abstract. If knowledge can be gained at the pre-processing stage, con- 
cerning the approximate underlying structure of large databases, it can 
be used to assist in performing various operations such as variable sub- 
set selection and model selection. In this paper we examine three meth- 
ods, including two evolutionary methods for finding this approximate 
structure as quickly as possible. We describe two applications where the 
fast identification of correlation structure is essential and apply these 
three methods to the associated datasets. This automatic approach to 
the searching of approximate structure is useful in applications where 
domain specihc knowledge is not readily available. 



1 Introduction 

If knowledge can be gained at the pre-processing stage, concerning the approx- 
imate underlying structure of very large databases, it can be used to assist in 
performing various operations, e.g. reducing the dimensionality of the database 
through the selection of variables for further analysis or grouping the variables 
into closely related subsets. In some applications it would be useful to gain this 
knowledge as quickly as possible. For example, if the procedure is part of a real 
time application or where the dataset is so large that a full analysis could take 
months. 

One way to find the structure in a dataset is correlation analysis. In this 
paper we experiment with three methods in finding the approximate correlation 
structure and compare these with an exhaustive search method for verification. 
In this section we describe two applications and corresponding datasets where 
the fast identification of correlation structure is essential. Section 2 describes two 
methods for calculating correlations, one being a linear method and the other be- 
ing able to calculate non-linear correlations. Section 3 describes various methods 
for performing fast, approximate search including our version of an evolutionary 
programming algorithm. Section 4 presents our results on the two datasets using 
the different algorithms and correlation coefficients. Finally, section 5 discusses 
the results and future work. 

D.J. Hand, J.N. Kok, M.R. Berthold (Eds.): IDA’99, LNCS 1642, pp. 1999. 
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1.1 Oil Refinery Process Data 

Many complex chemical processes record multivariate time-series data every 
minute. This data will be characterised by a large number of interdependent 
variables (in the order of hundreds per process unit) . There can be large time 
delays between causes and effects (over 120 minutes in some chemical processes) 
and some variables may have no substantial impact on any others. Correlations 
can change within the system depending on how the process is being controlled. 
If we want to perform diagnosis automatically (in as close to real time as pos- 
sible) a method to learn the current control structure would be required, which 
is calculated from the most recent data. In these situations sampling would be 
unsuitable because of the changes in control structure. One dataset used in this 
paper is from a Fluid Catalytic Cracker (FCC) P] and has a total of approxi- 
mately 300 variables. A dataset of this size would prevent the exhaustive search 
from being used as a comparison for other methods. Consequently steps were 
taken to use a sufficiently small dataset: 31 variables have been selected from 
the data containing approximately 10,000 data points. 



1.2 Visual Field Data 

The second data set is a section of Normal Tension Glaucoma (NTG) Visual Field 
(VF) Data. Glaucoma 0 is the name given to a family of eye conditions. The 
common trait of these conditions is a functional abnormality in the optic nerve, 
leading to a loss of visual field. This vision loss usually occurs only in part of the 
visual field, however untreated glaucoma can lead to blindness. Once diagnosed, 
a patient undergoes frequent outpatient appointments where their visual field 
is tested. The forecasting of a patient’s visual field is important in order to 
diagnose, monitor and control the progression of glaucoma. Correlation between 
points at different time lags can play a useful role in the monitoring of the disease 
progression; since many mathematical methods for time-series forecasting need 
the correlations between variables to complete the models. It would be useful to 
be able to do this during a patient’s regular consultation so that any decisions 
could be made while they wait, hence in as short a time as possible. The visual 
field dataset consists of 82 patients’ right eyes measured approximately every six 
months for between five (a time-series length of ten) and 22 years (a time-series 
length of 44). The particular test used for this dataset results in 76 points being 
measured, which correspond to a 76 variable time-series. 



2 Background 

We are interested in developing an algorithm that finds a “good-but-not-optimal” 
selection of “interesting” highly correlated variables in as short a time as possi- 
ble. The term interesting will, of course, depend on the application in question. 
For example, only cross- correlations will be of interest in the FCC domain. The 
number of correlations to be located would depend on the context in which the 
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method is being used (from now on we shall refer to this parameter as the rank 
size) . 

Correlation analysis is a way to measure how “coupled” two or more vari- 
ables are. Although this is not a reliable method with which to infer causality 
amongst variables, it can be useful in determining the underlying structure of a 
database. The correlation coefficient falls between the values -1 and -1-1 where 
-1 shows a strong negative correlation and -|-1 shows a strong positive. There 
exist various methods for calculating how correlated two variables are, the most 
common being Pearson’s Correlation Coefficient. 



Pearson’s Correlation Coefficient (PCC) ^21 measures the linear relation- 
ship between two continuous variables x and y. 



Pxy — 



Covjx, y) 

(T X O'y 



Where — 1 < Pxy < 1 



( 1 ) 



Cov{x,y) = 

n — 1 



(2) 



Where pxy is the value of PCC between x and y, n is the number of x, y pairs, 
ax and ay are the standard deviations for x and y respectively, Xi and yi are 
the ith instances of the variables x and y, and px and py are the expectations 
of the variables x and y. The calculation is, therefore, linear in computation time. 



Spearman’s Rank Correlation (SRC) m measures non-linear relationships 
between two variables, either discrete or continuous, by assigning a rank to each 
observation. It then incorporates the sums of the squares of the differences in 
paired ranks (di)^ according to the formula: 



n{n^ — 1 ) 



( 3 ) 



Where Rs is the value of SRC between the two variables, n is the number 
of pairs and each di is calculated by taking the difference between the ranks of 
each variable pair Xi and yi. This means the calculation is of the order nlog{n), 
since sorting must be used on data that is not already ranked. 



Time-Series. Time-series data is a collection of observations made sequentially 
in time [^. If one item is recorded at each time point, then the time-series is 
referred to as univariate. If more than one observation is made at each time 
point, then the series is multivariate. The medical, financial and process sectors 
are full of examples of time-series data sets, both univariate and multivariate. 

If the data to be analysed is a time-series then the Cross-Correlation Func- 
tion (CCF) can be used to explore how correlated two series are over differing 
time lags, therefore indicating the direction of influence. The Auto-Correlation 
Function (ACF) measures how closely correlated a variable is with itself over 
varying time lags. Both CCF and ACF may use one of the two correlation coeffi- 
cients described above. For time-series in which the lags are large, many different 
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coefficients must be calculated. If V is the number of variables and lag is the 
time lag that is under consideration, then the number of possible correlations 
is lag X V'^. There may be various real-world applications where the number of 
possible correlations may pose a problem, for example, where the structure of 
a dataset would be required in real time as the data is produced, or where a 
dataset is so huge that it would take an unreasonable amount of time to process. 
The two real datasets described in the previous section illustrate these scenarios. 
Therefore, for the analysis of time-series with high dimensionality or large time 
lag influences, there is a need for fast approximate searches over the number of 
possible correlations. 

For the FCC data, if the goal is to find a subset of highly correlated variables 
then auto-correlations will be considered irrelevant. For this reason only cross- 
correlations are explored. An assumption is made that the maximum time lag 
from cause to effect will be 60 minutes. Within the VF dataset, for each patient’s 
visual field data, correlations will be calculated using the 76 visual field points 
at a time lag of up to five (30 months). Since the length of each time-series that 
makes up the VF dataset is significantly shorter than that in the FCC dataset, 
all of the 82 available patients are used to make the problem comparable in 
complexity. The combined correlations for all of the patients were averaged to get 
a single value for two points at a given time lag so that the general dependencies 
over a representative population can be compared with the medical literature 
m for verification. 

3 Methods 

This section describes the methods employed to find subsets of highly correlated 
variables through time. All of these methods use the absolute value of the corre- 
lation coefficients, in order to rank a relationship between 0 and 1 inclusive (the 
objective was to locate dependencies but not their nature). Four methods were 
implemented: Random Bag (RB), Genetic Algorithm (GA) pi], Evolutionary 
Programming (EP) [ Il5^ti] and an Exhaustive Search. These are described in the 
remainder of this section and all make use of a triple, (x, y, lag), to represent the 
correlation between variables x and y with a time lag of lag. It has proved hard 
to find any existing methods that do not rely on the data being categorical, for 
example m- Note that the standard statistical solution would be to explore the 
whole search space, sample the time-series, or restrict the search space through 
the use of expert knowledge; all of which are inappropriate for the applications 
used in this paper. 

The Exhaustive Search. This method was performed on both data sets using 
Pearson’s and Spearman’s rank coefficients. Although in practice datasets could 
be of sufficiently large dimensionality and time-series length to preclude such a 
search, steps were taken to use sufficiently small datasets, so that the results 
could be used as a benchmark for the other methods. The exhaustive search 
consisted of simply exploring all of the variables, at each time lag. At time lag 
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zero, the correlations (x, y, 0) and (y, x, 0) are effectively the same so duplicates 
are removed. The correlations (x,x,0) are all one, and hence these are removed 
too. This would result in a total of 31,730 correlations for the VF data, or 
2,601,860 if all 82 patients’ data are used. With the FCC data all correlations 
of the form (x, x, y) are removed since these are auto-correlations and do not 
show relationships between different variables. This will mean a total number of 
55,335 different cross-correlations. This strategy of correlation removal was also 
performed in the other three methods. 



The Random Bag. This is a heuristic approach whereby a random selection 
of triples is placed in a “bag” containing Rank Size triples. With each iteration 
a new random triple is added to the bag. When the bag overflows, the worst 
correlation falls out. This is repeated for a predefined number of iterations. The 
algorithm is described below: 

Set i = 0, R = Rank Size, Q = Empty Queue 
While i < MAX Do 

t = new random triple: (x, y, lag) 
i = i 1 

If t is valid and t is not a member of Q Then insert t into Q 
Sort Q in descending order of correlation 
If size of Q = R-l-1 Then remove a triple from the tail of Q 
End While 

Note that a triple is valid if it does not warrant removal. MAX is the max- 
imum number of allowed calls to the correlation function. The final contents of 
the Bag represent the solution, i.e. the required Rank Size correlations. 

Genetic Algorithm. A Genetic Algorithm is a method for search based on the 
mechanics of natural selection and genetics. A population of chromosomes that 
represents possible solutions is used to explore the search space. This is achieved 
by updating the population with the creation of new chromosomes, formed 
through the recombination of two existing chromosomes (using the Crossover 
operator), a small perturbation analogous to mutation (by the Mutation Opera- 
tor), and the destruction of less fit chromosomes (through the Survival operator) 

m-- 



Generate Population chromosomes 
For i = 1 to Generations do 

Apply Grossover operator to Population 
Apply Mutate operator to Population 
Apply Survival operator to Population 

End For 

For the correlation problem, a chromosome is represented by a number of 
genes corresponding to the required correlation rank size. Each gene consists 
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of a correlation triple (x, y, lag) and each triple has three elements. Therefore 
each chromosome has a string of integer numbers, elementi, the length being 
equal to 3 x RankSize. One point linear crossover is used; the crossover point is 
restricted to multiples of three, thereby making sure that no gene will be split. 

Mutation is defined as follows: 

elementi = U(l,Max(elementi)), if elementi is to mutate 

Where elementi is the ith element (l<i<3xn)of the chromosome, 
U{a, b) returns a uniformly distributed random positive integer number between 
a and b inclusive, and Max{elementi) returns the maximum value that variable 
elementi could take. 

The parameters were selected through experimentation for optimal perfor- 
mance. These are listed in Table 1 where RW uses the Roulette Wheel technique 
and DS uses deterministic selection (this is where the individuals carried over to 
the next generation are the best, thus not allowing for duplicates as with RW). 



Table 1. GA Parameters 



Mutation% 


Crossover% 


Population 


Generations 


Rank Size 


Survival 


VF 


FCC 


VF 


FCC 


VF 


FCC 


VF 


FCC 


VF 


FCC 


VF 


FCC 


0.1 


0.4 


100 


100 


10 


10 


2500 


28 


100 


250 


RW 


DS 



Evolutionary Programming. Evolutionary Programming is based on a simi- 
lar paradigm to Genetic Algorithms. However, the emphasis is on mutation and 
the method does not use any recombination. The basic algorithm is outlined as 
follows (EEI): 

Generate Population chromosomes 

For i = 1 to Generations do 

Duplicate the Population to Ghildren 
Apply Mutate operator to Ghildren 
Add the Ghildren back to the Population 
Apply Survival operator to Population 

End For 

Traditionally, EP algorithms use Tournament Selection fl] during the sur- 
vival of the fittest stage and the best chromosome out of the final population 
will be the solution to the problem. However, it was decided that the entire pop- 
ulation would be the solution for our EP method as in the RB method. That is, 
each individual chromosome would represent a single correlation (a triple) while 
the population would represent the set of correlations found (Population Size = 
Rank Size). This therefore required a check for any duplicates after mutation, 
and for any invalid chromosomes. Any children that fell into this category were 
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repeatedly mutated until they became valid. Although the entire population 
would represent the solution, it must be noted that the fitness of each individ- 
ual would still be independent of the rest of the population. Each individual 
would try to maximise the correlation coefficient that it represents. This in turn 
would maximise the population’s fitness by improving the set of correlations 
represented by the population. The number of generations required for the two 
applications is in the range of 10 and 25. 

As can be seen with the GA method, a gene is a correlation triple (a;, y, lag). 
Within the EP, however, a gene is either x, y, or the lag. We have used the idea 
of Self- Adapting Parameters [5| in this context. Here each gene, genei, in 
each chromosome is given a parameter, ai. Mutation is defined as follows: 



gene^ = genei + N{Q,ai) 


(4) 


X exp{s + Si) 


(5) 


s = A(0,^) 
v2n 


(6) 




(7) 



Note that s is constant for each gene in each chromosome but different be- 
tween chromosomes, and Si is different for all genes. Both parameters are gen- 
erated each time mutation occurs. Initial examination of the performance of the 
RB method found that the performance was better than the GA (see section 
4). Similarities were drawn between this basic method and the EP algorithm. 
The major difference was, rather than adding a new random chromosome to the 
population, an existing member of the population is copied and mutated in a 
controlled manner. Each chromosome consisted of three parameters and their 
corresponding a values. The value of n is the size of each chromosome, i.e. three. 
Each gene within a chromosome is mutated according to the Normal distribution 
with mean 0 and standard deviation equal to the gene’s corresponding a value 
(equation 4). The a values, themselves, are mutated according to equation 5. 

4 Results 

For each dataset the different methods were run until the number of calls to 
the correlation function is equivalent to that of the exhaustive search. With 
the GA and the EP methods, this meant setting an artificially high number of 
generations. In practice this would be pointless. Figures 1-4 show the results 
from these experiments. Within these figures the logarithm has been taken of 
the number of function calls (to the appropriate correlation function) so as to 
highlight the difference in performance of the different algorithms. Next, an 
analysis was made of each method after approximately five percent of the total 
number of correlations was called. 
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4.1 FCC Results 

The FCC results took a significantly longer time to run than with the VF data, 
since the time-series for this dataset was considerably larger. A rank size of 250 
was decided upon for this dataset. The maximum average correlation for the top 
250 correlations is shown in Figures 1 and 2 (“Top 250”). It can be seen that the 
EP method converges nearer to this maximum during the first 3,000 function 
calls (~ 3.5 in Figures 1 and 2) before slowing. The RB method does the next 
best, converging at a slower rate than EP but faster than the GA which is by 
far the slowest. The result for the GA was the best produced from a number of 
experiments using differing parameter values. 

It was found that a very small mutation rate and a very high crossover rate 
was optimal in that it minimised function calls. It must be noted that although 
the function calls were reduced, the algorithm was still slowed drastically by the 
crossover operator. All methods eventually ’meet’ at about 30,000 function calls 
(~ 4.5 in Figures 1 and 2) just below the maximum. The results are almost 
identical for Pearson’s and Spearman’s correlation coefficient. 




i S J5 * -4 

Lu^D HuiiUu ri Clfc 

Fig. 1. Pearson’s Correlation Coefficient for the FCC Data 



Table 2 shows the analysis of the best correlations found. The exhaustive 
method shows the overall best, worst and average correlation in the entire possi- 
ble 55,335. Top 250 shows these for only the top 250 correlations. The remaining 
three columns show the results after approximately five percent of the correla- 
tions were calculated (~ 3, 000 function calls) for the RB method, the EP method 
and the GA method. From the table, it can be observed that the EP method 
has the best average for both Pearson’s and Spearman’s correlation coefficient. 

4.2 VF Results 

A rank size of 100 was chosen as it is large enough to show if the relationships 
between the 76 points correspond to the same nerve fibre bundles |2j. The average 
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Fig. 2. Spearman’s Correlation Coefficient for the FCC Data 



Table 2. Five Percent Freeze for the FCC Data 





Exhaustive 


Top250 


RB 


EP 


GA 


PCC 


SRC 


PCC 


SRC 


PCC 


SRC 


PCC 


SRC 


PCC 


SRC 


Num. of calls 


55335 


55335 


55335 


55335 


Kililil 


Bilill 


3203 


2821 


3240 


3283 


% of calls 


100 


100 


100 


100 


5.42 


5.42 


5.79 


5.10 


5.86 


5.93 


Best correlation 


1 


1 


1 


1 




1 


0.992 


1 


0.997 


1 


Ave. correlation 


0.259 


0.303 


0.975 


0.986 


0.767 


0.765 


0.888 


0.838 


0.355 


0.414 


Worst correlation 


0 


0 


0.961 


0.958 


0.624 


0.639 


0.775 


0.709 


0 


0.015 



correlation for the top 100 correlations is shown in Figures 3 and 4 (“Top 100”). 
It can be seen, as in the FCC dataset, that the EP method converges nearer to 
this maximum during the first 130,000 function calls (~ 5.1 in Figures 3 and 4.) 
before slowing. However, it can be seen, especially in the Pearson results, that the 
EP method remains consistently higher than the other methods, more so than 
in the FCC data. The RB method does the next best, converging at a slower 
rate than EP. However the GA method catches up with the RB method, as the 
number of calls increase. Again, the result for the GA was the best produced 
from a number of experiments using differing parameter values. In the Spearman 
experiments, the EP method is eventually outperformed by the GA and the RB 
method. However, this is after approximately 50 percent of the number of calls 
the exhaustive method made. 

Table 3 shows the analysis of the best correlations found. The exhaustive 
method shows the overall best, worst and average correlation in the entire pos- 
sible 2,601,860. Top 100 shows these for only the top 100 correlations. The re- 
maining three columns show the results after approximately five percent of the 
correlations were calculated (~ 130,000 function calls) for the RB method, the 
EP method and the GA method. Once again the EP method has the best average 
for both Pearson’s and Spearman’s correlation coefficient. 
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Fig. 3. Pearson’s Correlation Coefficient for the VF Data 
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Fig. 4. Spearman’s Correlation Coefficient for the VF Data 



4.3 Discussion 

To summarise the results, the EP method behaved better than the RB method 
and the GA in both datasets, using both correlation coefficients. It seemed that 
there was a larger difference in performance between EP and the others in the 
VF data. This is probably due to the fact that the numerical encoding of the 
variables map mathematically to spatial points upon the eye’s retina. A mu- 
tation would result in a form of rotation or translation of co-ordinates. The 
self-adapting standard deviations would adjust to a level where certain transfor- 
mations would result in many useful and high correlations, perhaps within the 
same nerve fibre bundle. In contrast, the variables in the FCC data are ordered 
in no meaningful way and so the self-adapting parameters can only adjust to 
the peaks within the cross correlation function. The GA performed the worst, 
particularly when the number of generations was set to a small number, because 
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Table 3. Five Percent Freeze for the VF Data 





1 Exhaustive 


1 ToplOO 


1 RB 


1 EP 


1 ga I 




PCC 


SRC 


PCC 


SRC 


PCC 


SRC 


PCC 


SRC 


PCC 


SRC 


N. calls 


~2.6m 


~2.6m 


~2.6m 


~2.6m 


130052 


130052 


126362 


130215 


123082 


120540 


% calls 


too 


100 


100 


100 


4.99 


4.99 


4.86 


4.99 


4.73 


4.63 


B. corr. 


0.654 


0.616 


0.654 


0.616 


0.608 


0.560 


0.612 


0.555 


0.599 


0.558 


A. corr. 


0.343 


0.338 


0.578 


0.526 


0.493 


0.463 


0.543 


0.498 


0.420 


0.408 


W. corr. 


0.186 


0.202 


0.549 


0.506 


0.453 


0.431 


0.508 


0.470 


0.292 


0.306 



the crossover operator merely mixes correlation genes around. The crossover op- 
erator is additionally designed to carry forward good schema, which cannot exist 
within this context (i.e. all genes are independent and hence exhibit low epista- 
sis). The GA could have been designed as in the EP method, with an individual 
representing a single correlation and the population representing the solution. 
However, unless a binary representation was used, the chromosome size would 
have been too small to fully exploit crossover. The binary representation would 
have been approximately 16 bits in the case of the FCC data. This is still very 
small, and some elementary experimentation has verified this. 

5 Concluding Remarks 

Within this paper we have explored several methods for quickly learning the 
correlation structure of a large dataset. We have applied these methods to two 
real world datasets that exhibit properties where this type of analysis is useful. 
That is, they are large multivariate time-series where the fast identification of 
the approximate correlation structure is needed. The results show that the EP 
method is by far the quickest to converge to a high average correlation. The 
self-adapting parameters appear ideally suited to finding meaningful clusters 
of correlations; however it still remains to be investigated where this method 
is appropriate, and where it falls down. We suggest that the EP method will 
perform no better than that of the RB method if there are no patterns within 
the underlying correlations of the dataset being explored. Here we are referring 
to the characteristics of the cross correlation function for the FCC data, and the 
spatial arrangement of the points within nerve fibre bundles in the VF data. 

Extensive work has been carried out to investigate the usefulness of this data 
pre-processing method. For example we have found that the EP method is an 
efficient way of speeding up an algorithm to find a Dynamic Bayesian Network 
structure for the FCC data. 
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Abstract. Post pruning of decision trees has been a successful approach 
in many real-world experiments, but over all possible concepts it does 
not bring any inherent improvement to an algorithm’s performance. This 
work explores how a PAC-proven decision tree learning algorithm fares 
in comparison with two variants of the normal top-down induction of 
decision trees. The algorithm does not prune its hypothesis per se, but 
it can be understood to do pre-pruning of the evolving tree. We study 
a backtracking search algorithm, called Rank, for learning rank-minimal 
decision trees. Our experiments follow closely those performed by Schaf- 
fer P2|. They confirm the main findings of Schaffer: in learning concepts 
with simple description pruning works, for concepts with a complex de- 
scription and when all concepts are equally likely pruning is injurious, 
rather than beneficial, to the average performance of the greedy top- 
down induction of decision trees. Pre-pruning, as a gentler technique, 
settles in the average performance in the middle ground between not 
pruning at all and post pruning. 



1 Introduction 

Pruning trades accuracy of the data model on the training data with syntactic 
simplicity in the hope that this will result in a better expected predictive ac- 
curacy on the unseen instances of the same domain. Pruning is just one bias — 
preference strategy — among a large body of possible ones, and it only works 
when it is appropriate from the outset. Schaffer m demonstrated that it is a 
misconception that pruning a decision tree, as compared to leaving it intact, 
would yield better expected generalization accuracy except for a limited set of 
target concepts. Suitably varying the setting of the learning situation gives rise 
to the better performance of the strategy which does not prune the tree at all. In 
particular, according to Schaffer ’s|2nj experiments pruning a decision tree when 
classification noise prevails degrades the performance of decision tree learning. 

We explore the practical value and generality of another kind of decision tree 
learning bias; one that is not geared towards avoiding overfitting nor minimizing 
the complexity of the resulting decision tree. In this approach one characteristic 
figure of the data model is, however, optimized. Inspired by the theoretical results 
of Ehrenfeucht and Haussler |S|, we test how optimizing a secondary size and 
shape related parameter — the rank of a decision tree — affects the utility of the 
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resulting bias. By changing the values of the input parameters of the learning 
algorithm called Rank (Jj we are tuning the fitting of the evolving decision tree 
to the training data. This is, in a sense, tempering with the level of pruning, 
but it happens in a controlled way: the fitting is taken into account already in 
growing the decision tree, when choosing the attributes and deciding the tree’s 
shape. The combined data fit endorsement and model complexity penalization 
yields, in Domingos ’ m terms, representations- oriented evaluation. 

The rank-optimizing algorithm works by backtracking, i.e., the algorithm 
may choose to revoke decisions it has made previously. The advantage of opti- 
mizing a size-related parameter is that the learning algorithm cannot suffer from 
pathology that affects decision tree learning algorithms using lookahead m nor 
is it as sensitive to the size of the training data as standard greedy top-down 
induction of decision trees m- Practical experience has shown the rank of a 
decision tree to be a very stable measure 0. 

Elomaa and Kivinen ^ and Sakakibara m proved that rank-bounded de- 
cision trees can be learned — within the PAC framework of Valiant — when 

classification noise prevails. The erroneous examples dictate that the fitting of 
the decision tree cannot be perfect. Relaxed fitting is obtained by pruning, not 
following full tree construction, but rather in pre-pruning fashion. Pre-pruning 
is a familiar technique from the early empirical decision tree learning algorithms 
like IDS [1 4f I . Recently it has, again, been advocated as an alternative to post 
pruning of decision trees [m^ . Pre-pruning also implements early stopping of 
training 1231 , which aims to prevent overtraining and, through it, overfitting. 

Our experiments confirm, in a slightly different setting, what was already 
reported by Shaffer m-- over the space of equiprobable random Boolean concepts 
the Naive strategy to decision tree learning, which does not even attempt to 
prune the decision tree built, outperforms the Sophisticated one, which may 
choose to prune the tree, independent of the classification noise level affecting 
the domain. The latter strategy, though, has constantly obtained better results 
in empirical experiments. One opposite trend to those reported by Schaffer comes 
up in our experiments; viz. as the noise rate approaches maximal .5, the better 
performances start to even out between the learning strategies, rather than to 
steadily pile up in favor of the Naive strategy. This result is what one would 
intuitively expect. 

The algorithm Rank has been previously observed to be competitive with 
other decision tree learning approaches on UCI data sets [Zl. Our experiments 
here further confirm that in learning simple concepts without large amounts of 
noise prevailing, rank-minimization attains as good results as the straightforward 
top-down induction of decision trees. However, on concepts that are harder to 
express by decision trees. Rank cannot match the performance of the Naive 
strategy. As the noise rate increases, the rank-minimization algorithm catches 
up the advantage the Naive strategy possessed initially. In comparison with the 
Sophisticated strategy i?onfc steadily outperforms it. 

In Section 2 we introduce briefly the learning of rank-bounded decision trees. 
In Section 3 a set of empirical experiments — inspired by those performed by 
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Fig. 1. Tree structures of rank 1 in 0X3(2). 



Schaffer ^ — is reported and discussed. Section 4 considers the outcome and 
implications of the experiments. Section 5 reviews further related research. Fi- 
nally, Section 6 presents the conclusions of this study. 

2 Learning Decision Trees of Minimum Rank 

Ehrenfeucht and Haussler pj showed that the function class represented by rank- 
bounded binary decision trees is learnable in the sense of the basic PAC model 
m- Its superclass — functions determined by arbitrary binary decision trees — is 
not learnable jO]. 

There is only one natural way to extend the original definition of the rank 
to general multivalued decision trees without losing the learnability property. 

Definition 1. The rank of a decision tree T, denoted by r{T), is defined as: 

1. IfT consists of a single leaf then r(T) = 0. 

2. Else ifTyj^SiX is a subtree ofT with the maximum rank Tmax, then 

r(T) = I unique, 

^ \ ^inax + 1 otherwise. 

This definition generalizes strictly that of Ehrenfeucht and Haussler 0: The 
rank of a binary tree is the same as the rank of the subtree with a higher rank if 
the two subtrees have different ranks. Otherwise, the rank of the full tree is the 
rank of its both subtrees incremented by one. 

Example 1. Let T)T^(n) denote the set of all m-ary decision trees of rank at 
most r over n variables. We are concerned with reduced decision trees, where 
each variable appears at most once on any path from the root to a leaf. Observe 
that in a reduced tree n is the maximum rank of the tree. 

Delimiting the rank of a decision tree (together with the value m) deter- 
mines the tree structures in DT()j(n). DT5(j(n) only contains the single-leaf tree 
structure, independent of the value of m. The number of possible labelings of 
that only leaf gives the number of functionally different equal-structured deci- 
sion trees; DTj(j(n) always contains n separate one-leaf decision trees. For values 
r > 0 there exists more than just one possible tree structure (assuming n > 1). 
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Fig. 2. Examples of tree structures in DT|(3). 



For instance, Fig. Q illustrates all (reduced) tree structures of rank 1 contained 
in DTi(2). 

The decision trees are obtained from these structures by assigning each node 
with a label; the labeling must be legal, i.e., it has to keep the tree reduced. The 
function represented by a decision tree is not necessarily unique. 

As the parameter values m, n, and r grow, the number of legal tree structures 
in DTJ)j(n) and their possible labelings go up quickly. The tree structures in Fig. 
El are examples of those belonging to DT§(3). 

In order to be able to construct decision trees of minimum rank in noisy 
domains one has to relax the fitting of examples by letting the decision tree 
give inconsistent decisions. Typically in top-down induction of decision trees as 
tightly fitted decision trees as possible are constructed and then, in the second 
phase, they are pruned back in order to avoid overfitting them to the false trends 
that happen to prevail in the training set I2IE]. Obviously, such pruning cannot 
be used in connection of rank-minimization. 

In the algorithm Rank growing of a branch of the evolving tree is stopped 
when relaxed fitting is reached. Two a priori determined parameters control the 
degree of the required fitting. The relaxed fitting amounts to pre-pruning. The 
construction of a branch of the evolving decision tree is stopped when only a 
small portion of the examples under consideration deviates from the majority 
class or a minimal number of the examples belongs to a different class than 
the majority one The values for the input parameters determining the 

relative portion of consistent examples required (parameter 7 ) and the absolute 
number of errors allowed per class (parameter k) before stopping the growing 
of a branch can be calculated exactly from the values of the standard PAC- 
parameters sample size, accuracy e, confidence S, and noise rate rj [TE^ . The 
three latter parameters, of course, are unknown in practice. 

We take 7 and k simply to be a real-valued and an integer- valued, respec- 
tively, input parameter for the learning algorithm. It is typical in inductive learn- 
ing methods to expect the user to supply a value for a confidence level or a 
threshold parameter, which corresponds to 7 parameter of Rank. For instance, 
C4.5 1161 is an example of such a program. Furthermore, in C4.5 the user is 
allowed to tune the value of a parameter (-m) that corresponds to k. 
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Table 1. The programs implementing the relaxed rank-minimization. Stopping- 
Condition is the relaxed fitting rule that implements pre-pruning. Find is the 
subprogram that checks, by backtracking search, if a decision tree of the given 
rank exists. Rank is the main program controlling the value of the rank bound. 



boolean Stopping Condition^ Sample S, int k, double 7 ) 

{ 

/ / Mi is the number of instances of i in S. k is the majority class in S. 
if ( Mfe > ylSI or Mj < k for all j ^ k (and Mk > k) ) return true; 
return false; 

} 

DecisionTree Find{ Sample S, int r, Variables V, int k, double 7 ) 

{ 

if ( Stopping Condition^ S, k, y ) ) return T = k\ // k is the majority class in S. 

if ( r = 0 ) return none; 

for ( each informative variable v & V ) { 

for ( each k G [m] ) <— Find{Sl, r — 1 , V \ {u}, k, 7); 

if ( Vfe € [m] : TJi / none ) 

{T ^ MakeTree{v,Ti , . . . return T;} 

if ( = none for a single value k = i G [m] ) { 

T- ^ Find(S];,r,V\{v},K,'yy, 

if ( T/ / none ) T ^ MakeTree{v, , . . . , T^)\ 
else T ^ none; 
return T; 

} 

} 

return none; 

} 

DecisionTree Rank{ Sample S, Variables V, int R, int k, double 7 ) 

{ 

r ^ R; T ^ Find{S, R, V, k, 7); 
if ( T = none ) { 

repeat { r ^ r -t 1 ; T ^ Find{S, r, V, k, 7); } 
until T yf none or r = |V|; 

if ( T = none ) T ^ T = k\ // k is the majority class in S. 

return T; 

} 

r ^ r(T); 
while ( r > 0 ) { 

Q ^ Find{S, r — 1,V, a, 7); 

if ( Q yf none ) { r ^ c(Q); T ^ Q; } 

} 

return T; 

} 
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The programs implementing the learning algorithm Rank are described as 
code in Table ^ The main program Rank inputs five parameters: the training 
sample S, the variable set V , the initial rank candidate R, and the values for 
the parameters k and 7 . Depending on whether a decision tree is found by the 
subprogram Find using the initial rank candidate or not, a search proceeding 
to different directions for the true rank of the sample has to be performed. If, 
at the end, no decision tree is found — due to too tight fitting requirements — a 
single-leaf tree predicting the majority class of the sample is returned. 

In the program Find it is assumed that all the variables in V are nomi- 
nal and have the same arity; i.e., that their domain consists of the value set 
[m] = {!,... ,m}. This assumption is included here for the clarity of the code, 
it is not a restrictive assumption. Find carries out the backtracking search for a 
decision tree of at most the given rank bound. Observe that this search is not 
exhaustive — which would require exponential time — but avoids unnecessary re- 
cursion by carefully keeping track of the examined rank candidates. The asymp- 
totic time requirement of this algorithm is linear in the size of the sample and 
exponential in the rank of the sample. However, the rank of the sample is con- 
stant, and usually very low in practice; hence, the algorithm is feasible. By S'^ 
we denote the set of those examples in the sample S in which variable v has 
value k. 

Finally, StoppingCondition checks whether the relaxed stopping condition 
holds in the sample under consideration. In other words, it returns value true if 
the conditions of pre-pruning are fulfilled and branch growing can be terminated. 

3 Empirical Evaluation 

We carry out a set of experiments inspired by those of Schaffer IZD]. In them two 
strategies to decision tree learning implemented by the CART learning algorithm 
P] were compared: Naive strategy chose the decision tree grown without a chance 
of ever pruning it and the Sophisticated strategy had the possibility of pruning 
the tree produced. The purpose of duplicating a part of these experiments is 
to assess the overall performance of the Rank algorithm and its pre-pruning 
strategy in contrast with two versions of the more familiar greedy top-down 
induction procedure. 

3.1 Experiment Setting 

Each experiment consists of 25 trials. We take all the trials into account, but 
pay special attention to the discrepant trials, those in which the the trees of the 
strategies differ in their accuracy. All experiments concern learning of Boolean 
functions on five attributes: named a through e. The number of training examples 
given to the learning algorithms is 50 randomly allotted instances. The prediction 
accuracy of a tree is analytically determined. 

In testing the statistical significance of our findings we follow Schaffer m and 
use the non-parametric one-sided binomial sign test. Given n discrepant trials in 
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an experiment, the test lets us reject the hypothesis that the strategies perform 
equally well with confidence 1 — X)i=o ((T)/^")’ r, 0 < r < [n/2j, is the 

number of “wins” recorded by the strategy obtaining less wins. All trials, except 
when otherwise stated, have a .1 classification noise affecting them; i.e., each 
training example has a one tenth chance of having a corrupted (complemented) 
class value. 

Instead of CART we use C4.5 (release 8) algorithm fi ti 1 7j to implement the 
basic learning strategies. As Schaffer argues nothing should change through 
changing the top-down induction algorithm; the relative strengths and weak- 
nesses of the strategies should stay constant. Sophisticated strategy is the default 
pruning of C4.5 and in the Naive strategy pruning has been turned off. 

In C4.5, which uses global post pruning, the performance of the final decision 
tree is less sensitive to the values of the input parameters than that of Rank, built 
based on local pre-pruning. For Rank we adjust the values of its parameters k and 
7 to suitable values empirically, but then keep the values constant throughout 
each experiment (25 trials). Only when large amounts of noise affect the learning 
situation will k have a value different from 1. We do not leave the order of 
attribute inspection to be arbitrary as described in the code in Tabled Instead, 
we use the evaluation function gini-index 0 to order the attributes. Nevertheless, 
the rank restriction ultimately decides whether an attribute is chosen to the tree 
or not. 



3.2 Preliminary Experiments 

Schaffer started with two simple learning tasks, where the Sophisticated 
strategy emerged superior. In learning the first simple concept Class = a, there 
were 18 of 25 discrepant trials and in each of them the Sophisticated strategy 
was superior obtaining, indeed, the maximum achievable average accuracy of .9. 

The learning algorithm Rank also attains the maximum achievable accuracy 
in each of the 25 trials using parameter value 7 = .75. In this case the correct 
concept is a decision tree containing (at least) two leaves, which have to ac- 
commodate also all the corrupted examples. Therefore, relatively loose fitting is 
required for learning the correct concept. 

In learning the second concept Class = a V (b A c), there were 17 discrepant 
trials in Schaffer’s experiment. The Sophisticated strategy chose a superior tree 
in 16 out of them, attaining the average accuracy of .854. 

This time a correct decision tree contains at least four leaves. Thus, the fitting 
of the tree does not need to be as relaxed as in the previous experiment. Using 
parameter value 7 = .85 Rank comes up with the correct concept in 18 of 25 
trials attaining the average accuracy of .876. 

In sum, these two simple concepts are at least as well learned by the back- 
tracking rank-minimization as by growing a full decision tree and subsequently 
pruning it. This result supports our earlier finding — on the UCI data — that Rank 
performs comparably with C4.5 on the most commonly used machine learning 
test domains 0. 
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3.3 Parity and Random Functions 

Parity function is true precisely when an odd number of the attributes take on 
the true value. It is hard to express by any representation using only single- 
attribute values. Parity serves to demonstrate the opposite case in Schaffer’s 
0 experiments: for it the fully-grown trees are more accurate than the pruned 
decision trees. Moreover, adding more noise to the training examples allows the 
Naive strategy to increase its lead over the Sophisticated strategy. 

In the basic .1 classification noise case, when the C4.5 algorithm was used to 
implement the two strategies, the Naive strategy outperformed the Sophisticated 
one in each trial. The average accuracy of the Naive strategy over the 25 trials 
was .675 and that of the Sophisticated strategy was .542. 

Parity is a worst-case function for rank-minimization since the correct de- 
cision tree for it has as high rank as possible. Therefore, any other matching 
concept of lower rank for the noisy training set is preferred over the true one. In 
light of this, one cannot expect Rank to attain the accuracy level of the Naive 
strategy. Using parameter value 7 = .9 Rank obtained the average accuracy of 
.552. The number of discrepant cases with the Sophisticated strategy was 19 of 
25, out of which in 12 trials Rank produced the superior decision tree. Naive 
strategy was superior to Rank in all trials. 

A better understanding of the strategies’ relative performance is obtained by 
repeating Schaffer’s experiment of learning a random Boolean function on five 
attributes. In each trial we randomly fix a new target function. Parameter values 
for Rank need to be adjusted differently for each target function. 

The number of discrepant trials between Rank and the Naive strategy is 
23, out of which in 22 Naive produces the superior tree. Rank outperforms the 
Sophisticated strategy in 14 of 16 discrepant trials. The average accuracies over 
the 25 trials are .777 for the Naive strategy, .728 for the Rank algorithm, and 
.701 for the Sophisticated strategy. 

These experiments go to show that when the correct decision tree has a 
syntactically complex tree description, close fitting of the tree to the training 
examples gives the best results. Learning random Boolean concepts demonstrates 
that the majority of concepts falls into this category. 

3.4 The Effect of Classification Noise 

The experiments reported by Schaffer m that bear most generality concern 
learning random Boolean functions. These experiments correspond to normal 
randomization analysis and give a measure of the average performance when all 
concepts are equally likely. However, as often pointed out this does 

not necessary conform to real-world performance of the strategies. 

Schaffer observed that as the classification noise rate of the training examples 
increases from .05 to .3, the number of discrepant trials steadily grows. The 
Sophisticated strategy is not able to conquer a larger proportion of those cases. 
Instead, the Naive strategy keeps advancing its superiority. This is a strong 
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Table 2. The effect of classification noise: random Boolean functions. 



Error Naive > Naive > Rank > Average accuracies 



rate 


Sophitic. 


Rank 




Sophist 


ic. 


Naive 


Rank 


Sophistic. 


.05 


22 of 22 


.99 


19 of 20 


.99 


15 of 17 


.99 


.782 


.738 


.700 


.1 


23 of 23 


.99 


22 of 23 


.99 


14 of 16 


.99 


.777 


.728 


.701 


.15 


18 of 21 


.99 


15 of 20 


.98 


17 of 22 


.99 


.709 


.691 


.656 


.2 


21 of 22 


.99 


17 of 21 


.99 


18 of 21 


.99 


.695 


.669 


.622 


.25 


19 of 21 


.99 


12 of 19 


.82 


11 of 15 


.94 


.682 


.663 


.639 


.3 


15 of 21 


.96 


13 of 23 


.66 


12 of 19 


.82 


.625 


.621 


.584 



demonstration of pruning lacking any inherent advantage in fighting against the 
effects of noise. 

Table 121 lists the results of our experiment. The first column gives the noise 
rate. The next two record in how many out of the discrepant cases, the Naive 
strategy outperformed the Sophisticated one, and what is the confidence level by 
which we can trust the former to be superior in this experiment. Corresponding 
column pairs are given for the remaining two learning strategy pairs. The three 
last columns report the average accuracies of the algorithms over the 25 trials. 

Our results are parallel to those reported by Schaffer; the Naive strategy is 
superior to the Sophisticated one independent of the error rate affecting the class 
attribute. However, there is also an opposite trend in here; when only a small 
error probability affects the learning situation the Naive strategy is with a very 
high confidence level significantly better than the Sophisticated strategy. As the 
error rate approaches .5, this confidence level starts to decline. 

This effect is clearer in the comparison with the Rank algorithm. When .3 
classification noise prevails, the confidence which we have for the Naive strategy’s 
superiority is only .66 and the confidence for the superiority of Rank over the 
Sophisticated strategy has dropped down to .82. The decreasing confidences are 
due to the fact that the concept represented by the training examples approaches 
a random one. Thus, no data model can predict the class of an instance. The 
same can also be observed from the dropping average accuracies of all three 
strategies; they all gradually approach the random guess’ accuracy .5. 

In Schaffer’s experiment the number of discrepant cases also increased along 
with the error rate. No such trend appears in the results of our experiment. In 
our experiment the noise rate as such has no bearing on the relative strengths 
of the learning strategies, none of them can be said to cope with classification 
noise better than the other strategies do. 



3.5 Further Experiments 

Schaffer [2I| further explored the effects of, e.g., changing the representation of 
the concepts, adding different kinds of noise to the training examples, and on the 
number of instances belonging to different classes. We, however, leave performing 
similar tests as future work. It is evident that the representation changes would 
have similar effects on our test strategies as they had on those of Schaffer. 
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4 Discussion 

The above experiments confirmed what Schaffer m already reported: In learn- 
ing syntactically simple concepts, pruning leads to more accurate decision trees 
than leaving the tree be as it is after growing it. There are concepts that require 
a syntactically more complex description. In learning such concepts it is better 
to do as accurate fitting of the decision tree to the training examples as possi- 
ble and not prune the tree. Most importantly, the concepts requiring a complex 
decision tree description are the ones that dominate the space of all possible 
concepts. 

The last point is crucial for the practical applicability of decision trees. How- 
ever, the empirical evidence overwhelmingly supports the hypothesis that most 
real-world learning domains have a syntactically simple decision tree represen- 
tation mED]. There also exists some analysis to back up this claim UHl- 

An interesting difference in the results of the above comparison and those of 
Schaffer is that as the noise rate approaches the maximal .5 level, both the Naive 
strategy and the Rank algorithm start to lose the edge that they have over the 
Sophisticated strategy. This effect is intuitive; as the noise rate increases, the 
class associated with the instances tends towards a random assignment. Thus, 
it is impossible for any learning approach to predict the class of an instance. 
As Schaffer [213, pp. 163-165] analyses, exact fitting is the best prediction policy 
when all concepts are equiprobable. Therefore, the Naive strategy maintains 
some edge over the Sophisticated one even with as high error rate as .3. 

Pre-pruning is not as aggressive as post pruning, which explains a substantial 
part of the differences observed in the performance of the Rank algorithm and 
the other two learning strategies. From these experiments it is hard to discern 
to what amount can we attribute these differences to the incomparable search 
procedures. However, in some trials the backtracking search clearly benefited 
and in others hampered the performance of decision tree learning. 



5 Related Research 

In a subsequent study Schaffer m generalized the main observations of the over- 
fitting avoidance bias into the conservation law of generalization performance, 
which essentially observes that there cannot be a universally superior bias. More 
or less the same has been expressed in Wolpert ’s m “no free lunch” theorems. 
The relevance of these results for practical inductive learning can be questioned, 
since experience has shown that many of the learning domains encountered in 
practice have an extremely simple representation HH and, hence, the bias of 
heavy pruning would suit practical learning tasks better than other biases. 

Holder mg has shown that the intermediate decision trees — subtrees pruned 
in a breadth- first order from the full trees grown by C4.5 — perform better than 
the full and pruned trees in a large corpus of UCI data sets. Intermediate de- 
cision trees correspond to pre-prunings of the full tree. Holder’s experiments 
also support our earlier observation: pre-pruning, if given a slight advantage by 
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careful parameter adjusting, may be a competitive alternative to post pruning 
of decision trees on real-world domains. 

Domingos @] has considered the correctness of different interpretations of 
Occam’s Razor. He compiles a substantial amount of analytical and empirical 
evidence against interpreting that the Law of Parsimony would somewhat qual- 
ify or justify pruning as an inherently beneficial technique in concept learning. 
No evidence supports assuming that lower syntactic complexity of a concept 
description would somehow transform into lower generalization error. 

6 Conclusions 

In this study we wanted to set into perspective of Schaffer ’s|2D| results the bias 
induced by pre-pruning of decision trees and to compare the bias of the rank- 
minimization to that of the standard top-down induction of decision trees. The 
performance of pre-pruning on the scale of the complexity of concept representa- 
tion settles down in between the Naive and the Sophisticated strategies. The bias 
resulting from rank-minimization was visible in some trials, but on the average 
level we cannot conclude much about it on the basis of these experiments. 

The main findings of our experiments are similar to those that were re- 
ported by Schaffer IZQ]: Sophisticated strategy is the better choice in domains 
with a simple description, while the Naive one is the better choice otherwise. 
The differences that were observed in learning random Boolean concepts when 
classification noise prevails are interesting and worth attention. 

In future work one could experiment with forcing a maximum rank bound 
for the decision trees — in the spirit of learning decision trees with stringent syn- 
tactic restrictions im — and observing such restriction’s effects on the prediction 
accuracy of the resulting tree. Further experimentation with the effects of dif- 
ferent noise generating schemes and other tests performed by Schaffer should be 
carried out in order to complete the study initiated in this paper. 
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Abstract. Although feature selection is a central problem in inductive 
learning as suggested by the growing amount of research in this area, 
most of the work has been carried out under the supervised learning 
paradigm, paying little attention to unsupervised learning tasks and, 
particularly, clustering tasks. In this paper, we analyze the particular 
benefits that feature selection may provide in hierarchical clustering. We 
propose a view of feature selection as a tree pruning process similar to 
those used in decision tree learning. Under this framework, we perform 
several experiments using different pruning strategies and considering 
a multiple prediction task. Results suggest that hierarchical clusterings 
can be greatly simplified without diminishing accuracy. 



1 Introduction 

The widespread use of information technologies produces an growing amount 
of data which is too huge to be analyzed by manual methods. There are large 
volumes of data containing both, many features and many examples. Inductive 
learning methods are a powerful method for automatically extracting useful in- 
formation from this data or for assisting humans in this process. A problem 
related to this sort of data is the presence of a large number of features that 
might tend to decrease the effectiveness of learning algorithms, especially if most 
of these features appear to be irrelevant with regard to the learning task. In fact, 
feature selection is a central problem in inductive learning as suggested by the 
growing amount of research in this area PE|. 

However, most of the work concerning feature selection has been carried out 
under the supervised learning paradigm, paying little attention to unsupervised 
learning tasks and, particularly, clustering tasks. Clustering is a form of un- 
supervised learning used to discover interesting patterns in data. Particularly, 
hierarchical clustering methods construct a tree-structured clustering where sib- 
ling clusters partition the observations covered by their parent. The particular 
knowledge organization and performance tasks of hierarchical clusterings sug- 
gest several dimensions to analyze the benefits of feature selection in hierarchical 
clustering tasks. In this paper, we propose a novel view of feature selection as 

D.J. Hand, J.N. Kok, M.R. Berthold (Eds.): IDA’99, LNCS 1642, pp. 75-|^^ 1999. 
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pruning in hierarchical clustering and propose some possible implementations. 
In addition, we perform an empirical comparison of these methods analyzing the 
results under the proposed dimensions. 

2 Feature Selection in Hierarchical Clustering 

Typically, the primary goal of feature selection is intended to make inductive 
learning algorithms more robust in the face of irrelevant features. Clearly, this 
may be a motivation for applying feature selection to hierarchical clustering 
tasks. However, it is important to highlight two specific factors that surround 
these tasks that may be relevant for feature selection. The first factor is the form 
of the knowledge base. Commonly, hierarchical clusterings are polythetic classi- 
fiers, that is, they divide objects based on their values along multiple features. 
Particularly, they tend to use the full set of features at each node to decide 
how to classify a new object. Note that, while in monothetic classifiers such as 
decision trees, a redundant feature adds one additional test when classifying 
a new observation, in polythetic classifiers it adds a test for each node in the 
classification path. 

Secondly, we should take into account the performanee task in which hier- 
archical clusterings should be useful. As in most inductive data analysis ap- 
proaches, we can differentiate between prediction and description tasks Pj. As 
unsupervised approaches, hierarchical clustering systems are not restricted to 
predict a single class label. As remarked by several authors, unsupervised learn- 
ing systems can support a flexible prediction task aimed to support prediction 
over all the features inmn]. Because of this multiple inference task, the presence 
of irrelevant features may be even more harmful in unsupervised systems. On the 
other hand, unsupervised learning may focus in description tasks, rather than in 
prediction. In such a case, we can view feature selection as a means of simplifi- 
cation of concept descriptions that may provide a more readable interpretation 
of the domain, without necessarily taking into account accuracy concerns. 

In order to clarify further discussion, we now attempt to summarize four 
different dimensions to evaluate the particular benefits of feature selection in 
the hierarchical clustering task: 

— Irrelevant features. The set of features used in an inductive learning task is 
a powerful representational bias that determines the performance a learn- 
ing system. Irrelevant features may be particularly harmful in unsupervised 
systems since they try to form patterns around sets of correlated features 
without the guidance of external labeling. 

— Efficiency in the learning task. As we have noted, hierarchical clusterings are 
polythetic classifiers. Since the decision of how to classify a new instance has 
to be made along several nodes in the tree, the number of features present 
in the data directly influences the complexity of the clustering process. If we 
apply feature selection to reduce this complexity, we should expect to obtain 
clusterings of similar or better quality that we would had obtained by using 
all the available features. 
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— Ejficiency in the performance task. When using a hierarchical clustering to 
classify an unseen observation in order to infer unknown properties, the 
number of features still has a strong influence in the complexity of the process 
in the same manner we have described above. Again, selecting an appropriate 
subset of features may reduce this complexity. Since, in this case, the concern 
lies in exploiting feature selection to speed-up a prediction task, learning does 
not necessarily have to be affected by the feature selection process. 

— Comprehensibility of the results. With the exception of logic-based approaches 
which select features by using the dropping conditions rule, clustering sys- 
tems usually make use of all the available features at each node of the hier- 
archy. Reducing the number of features used in the clustering process allow 
to provide shorter cluster descriptions to the user. Short descriptions tend to 
be more readable and, hence more comprehensible. Comprehensibility has 
been recognized as a specially important concern in clustering El- 



3 Feature Selection as Pruning Concept Trees 

Tree-based models are typical in inductive learning approaches. A powerful 
method for supervised classification tasks is the induction of decision trees m 
In early approaches, the tree was expanded until all objects of a child were 
members of the same class. However, it has been found that this strategy may 
overfit the data by growing the tree more than is justified by the training set. To 
solve this problem, pruning strategies have been developed that either stop tree 
expansion during learning, or remove certain nodes in a post-processing step. 
Although the term of feature selection is usually used in supervised learning 
to denote a process external to the induction algorithm, pruning decision trees 
may be viewed as such a process, because we reduce the number of terms used in 
concept descriptions. In addition to avoid overfitting, pruning produces simpler 
trees and, thus, more comprehensible descriptions. 

As remarked by Langley m, pruning methods can be combined with any 
learning technique that deals with complex structures. Particularly, hierarchical 
clusterings may grow arbitrarily, producing lower levels that are not justified by 
the training data thus justifying the use of pruning methods. Because hierarchical 
clusterings can be used to predict many features, a pruning of the tree that is 
appropriate for one feature, may be innapropriate for other features. To cope 
with this problem, we can identify frontiers of clusters for prediction of each 
feature 00. With this strategy, in order to predict a feature value, classification 
has not necessarily to terminate at a leaf, but may stop in an intermediate node. 

The mentioned approaches to pruning in both, decision trees and hierarchical 
clusterings, constraint the depth of the tree. Therefore, they may be viewed as 
vertical pruning strategies aimed to reduce the height of the resulting tree. In 
decision trees, this constrain results in a reduction of the number of terms used 
in the final descriptions, but as a polythetic classifier, removal of a node in a 
hierarchical clustering reduces the length of the classification path, but the full 
set of features is still used in concept descriptions. However, if we see a node 
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Fig. 1. Vertical (a) vs. horizontal (b) pruning. Dashed lines indicate pruned 
nodes or features. 



in a hierarchical clustering as a set of tests to be made -one for each feature- 
to classify an instance through the children of this node, we can think about 
some sort of horizontal pruning strategy aimed to reduce the width of the levels 
of the tree by removing some features. Figure Q graphically shows these ideas. 
Note that the complexity of classifying an object in a hierarchical clustering 
depends on both, the length of the path, and the number of features included 
in the computation of the metric used to decide the best host. Therefore, both 
horizontal and vertical pruning strategies, should improve the prediction task 
by reducing the path and the number of features used in computing the metric 
used for clustering, respectively. In addition, horizontal pruning can provide more 
readable concept descriptions, thus improving performance along two of the four 
evaluation dimensions presented in Section El 

4 Horizontal Pruning Methods 

Our proposal of viewing feature selection as pruning, constraints feature selection 
to be performed over an existing knowledge structure. This view contrasts with 
the traditional perspective of feature selection as a preprocessing step performed 
before learning occurs. There are two different points in the learning process in 
which this sort of feature selection can be performed, during learning and after 
learning. Note that this distinction is analogous to the traditional distinction 
into prospective (or prepruning) and retrospective (or postpruning) strategies. 

Methods for deciding which features should be selected can be designed in 
a similar fashion to existing pruning methods for decision trees. Any pruning 
method must estimate the ‘true’ error rate of the pruned trees. For example, 
some pruning methods rely on some form of significance tests, such as to 
determine whether the unpruned tree is significantly better that the pruned 
version. Other are based on resampling strategies such as holdout or cross- 
validation. These examples suggest a division of horizontal pruning strategies 
into blind and feedback methods. Blind methods consider features independently 
of the induction algorithm used, relying on some sort of estimation based on the 
data. Feedback methods use the results of the induction algorithm to obtain an 
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estimate of performance. A similar distinction of feature selection methods is 
that into filter and wrapper methods. 

Additionally, the pruning-based view suggest another interesting issue for 
feature selection. Existing methods of feature selection select a unique subset of 
features which is then used in the learning process. In our case, this means that 
all the nodes in all the levels are pruned in an homogeneous fashion. However, 
it appears reasonable to think about a more selective pruning strategy that 
considers different sets of features for each node. Different subsets of features 
may perform better at different local parts of the observation space than a single 
global set. Therefore, we can classify horizontal pruning strategies along a second 
dimension according to the scope of the pruning and distinguish into local and 
global methods. This view of feature relevance can be viewed as similar to the 
local weighting approaches used in lazy learning m- 

The presented dimensions may suggest a variety of strategies for horizontally 
pruning concept trees when combined with usual methods for organizing the 
search into the space of features. For example, we can start with the empty set 
of features and successively add features or we can start with all the features 
and successively remove them. These widely known strategies are called forward 
selection (FS) and backward elimination (BE) and may be used together with 
some error estimation method in order to find a suitable subset of features. Under 
our framework, we can choose between implementing blind or feedback based 
versions of the FS and BE strategies. Moreover, we can apply the procedures 
once for the whole hierarchical clustering, obtaining an homogeneous pruning or 
apply them locally to each node. Therefore, we can talk about local and global 
versions of FS and BE. 



5 Experiments 

In our experiments, we implemented several methods covering local, global, blind 
and feedback approaches. Particularly, we are interested in investigating the 
impact of horizontal pruning along two dimensions, namely, efficiency of the 
performance task and comprehensibility. The performance task considered is the 
flexible prediction task previously discussed, which is measured as the average 
accuracy over all of the features present in the data. This accuracy is computed 
for each instance by repeatedly masking out one feature and then using the 
cluster hierarchy to classify the instance and predict the masked value. Final 
accuracy is obtained by averaging the accuracy obtained for each individual 
feature. The predicted value always corresponds to the one found in the reached 
leaf. 

We measured prediction efficiency empirically by recording the number of 
feature tests needed to classify an instance when making a prediction. For in- 
stance, if there are n features, for a given instance, the COBWEB algorithm first 
evaluates the utility of making an independent cluster and performs n tests to 
compute the category utility metric. Other operators such as evaluating the util- 
ity of incorporating the instance to an existing cluster among k different clusters. 
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need k-n tests. The sum of all these tests averaged for all the predictions made, 
is used as an indicator of the complexity of making predictions. Finally, compre- 
hensibility is evaluated simply by looking at the average number of features per 
node. We assume that shorter descriptions are more readable, and thus, more 
comprehensible . 

The hierarchical clustering method used is the Cobweb system [S], since 
it is a well-known method and its basic strategy is the core of many unsuper- 
vised learning systems. Experiments were performed over three standard data 
sets from the UCI Repository: soybean small, soybean large and zoo database. 
The two first data sets are described by 35 features and the third by 16, thus 
they should give a good preliminary picture of the power of feature selection as 
horizontal pruning. 

5.1 An Overview of Cobweb 

Cobweb 0 is a hierarchical clustering system that constructs a tree from a 
sequence of observations. The system follows a strict incremental scheme, that is, 
it learns from each observation in the sequence without reprocessing previously 
encountered observations. An observation is assumed to be a vector of nominal 
values Vij along different features Ai. Cobweb employs probabilistic concept 
descriptions to represent the learned knowledge. In this sort of representation, 
in a cluster Ck, each feature value has an associated conditional probability 
P{Ai = Vij I Ck) reflecting the proportion of observations in Ck with the value 
Vij along the feature Ai. 

The strategy followed by Cobweb is summarized in Tabled Given an obser- 
vation and a current hierarchical clustering, the system categorizes the observa- 
tion by sorting it through the hierarchy from the root node down to the leaves. 
At each level, the learning algorithm evaluates the quality of the new cluster- 
ing resulting from placing the observation in each of the existing clusters, and 
the quality resulting from creating a new cluster covering the new observation. 
In addition, the algorithm considers two more actions that can restructure the 
hierarchy in order to improve its quality. Merging attempts to combine the two 
sibling clusters which were identified as the two best hosts for the new obser- 
vation; splitting can replace the best host and promote its children to the next 
higher level. The option that yields the high quality score is selected and the 
procedure is recursed, considering the best host as the root in the recursive call. 
The recursion ends when a leaf containing only the new observation is created. 

In order to choose among the four available operators. Cobweb uses a cluster 
quality function called category utility deAned for a partition P = {Ci, C 2 , ..., Cn} 
of n clusters as 

E» P(Ct) E, EilP(.4i = v„ I - P(A, = Vyf] 

^ 

This function measures how much a partition P promotes inference and re- 
wards clusters Ck that increase the predictability of feature values within Ck- 
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Function Cobweb(observation,root) 

1) Incorporate observation into the root cluster. 

2) If root is a leaf then 

return expanded leaf with the observation, 
else choose the best of the following operators: 

a) Incorporate the observation into the best host 

b) Create a new disjunct based on the observation 

c) Merge the two best hosts 

d) Split the best host 

3) If a), c) or d) recurse on the chosen host. 



Table 1. The control strategy of Cobweb. 



1. F = 0 

2. For each feature Ai not in F 

Estimate performance by classifying objects in the validation set starting from 
the root and using F U {At} for the considered node. 

3. Let be Am the feature yielding the highest improvement. 

4. If the performance of F U {Am} is higher than the performance of F, add Am to 
F and goto El 



Table 2. The FS algorithm for pruning features locally from a node. 



Generally, the system is evaluated in terms of its predictive accuracy along all 
the features. 

5.2 Experiment 1: Feedback Methods 

Our first set of experiments is intended to explore the performance of feedback 
based methods. Our implementation uses a simple holdout strategy that divides 
the data set into three subsets: 40% for training, 40% for validation and 20% 
for test. A hierarchy is constructed using the training set and then, the vali- 
dation set is used to estimate the performance of each of the different subsets 
of features considered. The final results shown correspond to the accuracy on 
the test set. We implemented four versions of this procedure by applying FS 
and BE strategies locally or globally. Table El shows the pseudo-code for the FS 
procedure applied locally to a node. Pruning starts by applying the procedure 
to the root node and it is recursively applied to the descendants until reaching 
a leaf node. The implementation of the BE strategy is analogous but starting 
with the full set of features and removing one feature at each step. The global 
strategies are implemented by estimating accuracy in step 2 using the selected 
subset of features in all the nodes, instead of only in a particular node. Recall 
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Table 3. Results for feedback-based methods: global and local FS and BE. 



Dataset 


Pruning 


Accuracy 


Tests 


Features /Node 


ZOO 


None 


85.50 ± 2.72 


162.76 ± 8.44 


16.00 ± 0.00 


GFS 


82.31 ± 4.55 


75.76 ± 38.07 


7.33 ± 3.78 


GBE 


85.83 ± 2.53 


117.13 ± 19.79 


11.10 ± 1.76 


LFS 


85.81 ± 2.68 


25.52 ± 5.16 


1.21 ± 0.12 


LBE 


85.91 ± 3.14 


42.51 ± 8.61 


1.80 ± 0.18 


soybean 

small 


None 


85.70 ± 2.35 


277.48 ± 24.97 


35.00 ± 0.00 


GFS 


85.82 ± 2.35 


55.68 ± 26.55 


7.07 ± 3.12 


GBE 


85.11 ± 3.00 


134.25 ± 57.82 


16.46 ± 6.52 


LFS 


84.58 ± 2.78 


18.37 ± 2.79 


1.31 ± 0.23 


LBE 


85.41 ± 2.21 


40.12 ± 17.21 


3.40 ± 1.17 


soybean 

large 


None 


83.53 ± 1.46 


466.33 ± 22.15 


35.00 ± 0.00 


GFS 


81.60 ± 3.48 


240.87 ± 93.70 


17.95 ± 6.75 


GBE 


83.24 ± 1.27 


385.01 ± 37.62 


28.85 ± 2.12 


LFS 


N/A 


LBE 


N/A 



that global methods select the same subset for all the nodes, so there is no need 
to recurse. 

Table 0 shows the results from 30 trials with pruned clusterings generated 
with the four methods mentioned above from random instance orderings. For 
comparative purposes, we include the results obtained from unpruned trees con- 
structed over the training set and tested on the test set, but without including 
the validation set. The lack of results of local pruning methods for the soybean 
large data set is due to the high processing time required. This highlights a first 
result: feedback-based local methods are likely to be impractical for data sets 
with high number of features. The reason is that we are trying to optimize a 
multiple prediction task, so that we need to run multiple tests when estimating 
performance instead of a single test of predicting a class label. For this extended 
task, the complexity of the implemented feedback methods for a data set with 
n features is 0{n^), instead of the O(n^) required in supervised tasks. 

Paradoxically, local methods appear to be the better performers, dramatically 
improving performance along both, prediction efficiency and comprehensibility 
dimensions. In the two first data sets, results suggest that the average number 
of tests required to classify an unseen object for prediction purposes may be 
improved in a range between 75-85% without diminishing accuracy. In addition, 
less than two features per node in the zoo data set and less than four in the small 
soybean appear to be enough for achieving accurate predictions, and presumably, 
for describing the resulting hierarchies in a more readable manner. On the other 
hand, FS methods are more likely to get stuck at local minima, thus producing 
lower accuracies. On the contrary, BE methods are more conservative but at the 
expense of obtaining larger feature subsets. 
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5.3 Experiment 2: Blind Methods 

In our second set of experiments, we evaluate the performance of blind methods. 
Analogously to the first experiments, we can implement local and global versions 
of a blind method. However, since we have found evidence of the superiority of 
local over global strategies, we restrict our implementation of blind methods to 
the former. 

We implemented a method that could be cast as a Critical Value Pruning 
(CVP) method IT^ . This method is applied in decision tree induction by selecting 
a threshold or critical value for the attribute selection measure. A node is pruned 
if the value obtained by the measure does not exceed the critical value. In the 
case of horizontal pruning, we should use a measure estimating the predictive 
power of the features. A natural approach in COBWEB and related systems, is 
to use the evaluation metric used in the clustering process, since, it is supposed 
to promote partitions with high inference power. Therefore, for each feature, we 
can obtain a numerical estimate of performance by measuring the contribution of 
this feature to the category utility calculation and obtain what Gennari |H| terms 
the salience of a feature. The salience metric induces a partial order of relevance 
for the set of features that can be easily exploited to select a subset of features 
introducing a threshold. Specifically, for a given set of features {xi,X 2 , ■ ■ ■ ,Xn} 
and a given r value, we remove the features in the set 

{a; I Salience(x) < r • Max{Salience{xi) | i = 1, . . . , n}} 

The r value is in the [0,1] range, so that r = 0 means that no feature 
selection is performed because salience is always positive, while increasing t 
values will select smaller subsets of features. The pruning procedure proceeds 
starting from the root node and recursively removing at each node the features 
scoring under the selected threshold. Note that, by considering individual nodes, 
we are considering different subsets of the object space and, therefore, different 
probability distributions for the features. Because of that, a different subset of 
features may be selected at each node as we expect from using local strategies. 

Table El shows the results from 30 trials with pruned clusterings generated 
with different r values from random instance orderings. We only used the training 
and test sets used in the first experiments, excluding the validation set in order 
to obtain comparable results. As expected, the degree of pruning increases for 
higher r values. Again, we can observe that simpler clusterings perform as well 
as clusterings using the full feature set. The case of the large soybean data set is 
especially interesting, since we could not get a good picture of the performance 
that horizontal pruning can provide in our first experiment. Clearly, efficiency 
may be improved at least in a 50% and the set of defining features per node 
reduced around eight. From the results, it appears that blind methods select 
larger subsets of features that feedback methods in order to achieve comparable 
accuracies. This may be because feedback methods are less sensitive to feature 
dependences that can not be detected by the salience metric. This suggests that, 
although the results are very good, more drastic local pruning may be performed 
in the soybean large data set by selecting the correct set of features. 
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Table 4. Results for the blind method using different r values. 



Dataset 


T 


Accuracy 


Tests 


Features /Node 




None 


85.50 


± 


2.72 


162.76 ± 8.44 


16.00 


± 


0.00 




0.10 


85.75 


± 


2.91 


79.89 ± 8.21 


7.48 


± 


1.17 




0.20 


85.51 


± 


3.54 


69.09 ± 8.73 


7.15 


± 


1.23 




0.30 


85.65 


± 


3.66 


60.45 ± 8.51 


6.86 


± 


1.23 




0.40 


85.70 


± 


3.85 


54.85 ± 8.35 


6.72 


± 


1.23 


ZOO 


0.50 


85.14 


± 


3.57 


49.00 ± 7.94 


6.52 


± 


1.22 




0.60 


84.50 


± 


3.17 


42.47 ± 7.43 


6.32 


± 


1.23 




0.70 


83.91 


± 


3.43 


38.38 ± 8.51 


6.13 


± 


1.26 




0.80 


82.61 


± 


4.50 


36.16 ± 7.76 


5.97 


± 


1.27 




0.90 


81.07 


± 


4.82 


32.95 ± 8.57 


5.86 


± 


1.29 




None 


85.70 


± 


2.35 


277.48 ± 24.97 


35.00 


± 


0.00 




0.10 


85.70 


± 


2.30 


98.87 ± 15.27 


7.55 


± 


0.93 




0.20 


85.50 


± 


2.45 


88.27 ± 15.60 


7.00 


± 


0.80 




0.30 


85.46 


± 


2.55 


80.02 ± 15.62 


6.54 


± 


0.87 


soybean 


0.40 


85.53 


± 


2.69 


69.24 ± 14.09 


6.06 


± 


0.91 


small 


0.50 


85.53 


± 


3.13 


59.83 ± 10.79 


5.55 


± 


0.77 




0.60 


85.16 


± 


2.87 


49.86 ± 10.81 


4.93 


± 


0.75 




0.70 


85.16 


± 


2.30 


35.62 ± 7.82 


3.73 


± 


0.60 




0.80 


84.56 


± 


2.67 


27.03 ± 5.54 


3.12 


± 


0.50 




0.90 


83.99 


± 


2.83 


21.84 ± 4.76 


2.71 


± 


0.59 




None 


83.53 


± 


1.46 


466.33 ± 22.15 


35.00 


± 


0.00 




0.10 


83.65 


± 


1.53 


212.36 ± 17.58 


8.30 


± 


0.59 




0.20 


83.17 


± 


1.50 


171.89 ± 18.25 


7.58 


± 


0.58 




0.30 


82.25 


± 


1.76 


139.12 ± 17.42 


6.89 


± 


0.58 


soybean 


0.40 


81.16 


± 


1.86 


111.89 ± 14.07 


6.21 


± 


0.53 


large 


0.50 


79.65 


± 


1.98 


87.26 ± 11.71 


5.50 


± 


0.50 




0.60 


78.47 


± 


2.59 


68.70 ± 10.43 


4.84 


± 


0.49 




0.70 


77.87 


± 


1.74 


51.63 ± 7.41 


3.88 


± 


0.41 




0.80 


75.90 


± 


2.93 


38.93 ± 5.98 


3.29 


± 


0.39 




0.90 


74.85 


± 


3.00 


31.87 ± 3.88 


2.90 


± 


0.36 



6 Related Work 

As we have pointed out, there is a few body of work in feature selection for clus- 
tering tasks. Gennari 0 investigated a feature selection mechanism embedded in 
CLASSIT, a descendant of COBWEB, and made some preliminary experiments. 
However, his research differs from this work in that we focus in a complex flexible 
prediction task and not in predicting a single class label. Two works that apply 
feature selection as a preprocessing step are 0 and 0, but again, evaluation is 
performed over class labels (but see [14^ for some results in a multiple prediction 
task using a preprocessing step). 

In sum, our work is novel in two aspects. First, the retrospective pruning 
view makes feature selection a postprocessing step rather than a preprocessing 
one as is common in these tasks. Secondly, we focus in a multiple inference 
task that have been proposed for unsupervised learning systems instead of class 
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label prediction. To our knowledge, no other works have approached the effect 
of feature selection in such a task. 



7 Concluding Remarks 

We have presented a view of feature selection as a retrospective tree pruning pro- 
cess similar to those used in decision tree learning. We think that this framework 
is a useful abstraction for the design and understanding of postprocessing ap- 
proaches to feature selection in hierarchical clustering. This becomes especially 
important given the particular nature of the clustering task, that introduces ad- 
ditional factors of complexity with respect to supervised tasks. We have briefly 
outlined some of these factors such as the polythetic nature of hierarchical clus- 
terings or the formulation of the performance task as a multiple inference task 
along all the features in the data. Additionally, the framework provides a sin- 
gle view of the complex problem of simplifying hierarchical clusterings, unifying 
feature selection and node removal procedures. This novel approach should lead 
to useful combinations of both approaches. 

The empirical results show evidence on the power of feature selection in 
simplifying hierarchical clusterings. Particularly, local methods appear to dra- 
matically reduce the number of features per node without diminishing accuracy. 
Our implementation of blind methods performed quite well, so suggesting that 
they are a promising alternative to more computationally expensive feedback 
methods. 

Our work suggest that existing approaches to decision tree pruning may 
importantly inspire further research in novel methods for unsupervised feature 
selection. Future work should also evaluate the performance of integrated hori- 
zontal and vertical pruning methods for hierarchical clustering. 
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Abstract. In many modern data analysis scenarios the first and most 
urgent task consists of reducing the redundancy in high dimensional in- 
put spaces. A method is presented that quantifies the discriminative 
power of the input features in a fuzzy model. A possibilistic information 
measure of the model is defined on the basis of the available fuzzy rules 
and the resulting possibilistic information gain, associated with the use 
of a given input dimension, characterizes the input feature’s discrimi- 
native power. Due to the low computational expenses derived from the 
use of a fuzzy model, the proposed possibilistic information gain gen- 
erates a simple and efficient algorithm for the reduction of the input 
dimensionality, even for high dimensional cases. As real-world example, 
the most informative electrocardiographic measures are detected for an 
arrhythmia classification problem. 



1 Introduction 

In the last years it has become more and more common to collect and store 
large amounts of data from different sources However a massive recording 
of system’s monitoring variables does not grant a better performance of further 
analysis procedures, if no new information is introduced in the input space. 
In addition the analysis procedure itself becomes more complicated for high 
dimensional input spaces and insights about the system’s underlying structure 
more difficult to achieve. 

An evaluation of the effectiveness of every input feature in describing the un- 
derlying system can supply new information and simplify further analysis. The 
detection of the most informative input features, that is the features character- 
izing at best the underlying system, reduces time and computational expenses 
of any further analysis and makes easier the detection of crucial parameters for 
data analysis and/or system modeling. 

A quite common approach for the evaluation of the effectiveness of the input 
features defines some feature merit measures, on the basis of a statistical model 
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of the system 0 Assuming that a large database is available, the proba- 
bility estimations, involved in the definition of the feature merit measures, are 
performed by means of the events frequencies, which require a precise defini- 
tion of the input parameters and a clear identification of the output classes. In 
many real world applications, however, estimated frequencies are unavoidably 
alterated by doubtful members of the output classes and by an inaccurate de- 
scription of the input parameters. In addition the estimation of a probabilistic 
model is computationally expensive for high dimensional input spaces. 

The concept of fuzzy sets was introduced in |S| with the purpose of a more 
efficient, though less detailed, description of real world events, allowing an ap- 
propriate amount of uncertainty. Fuzzy set theory yields also the advantage of 
a number of simple and computationally inexpensive methods to model a given 
training set. Based on the fuzzy set theory, some measures of fuzzy entropy have 
been established pB, ^ as measures of the degree of fuzziness of the model with 
respect to the training data. All the defined measures involve the data points 
into the fuzzy entropy calculation, in order to represent the uncertainty of the 
model in describing the training data. 

In this paper an analysis “a posteriori” of fuzzy systems is proposed, to 
evaluate the discriminative power of the input features in characterizing the 
underlying system. A measure of possibilistic information is defined only on 
the basis of fuzzy rules. The separability of the different membership functions 
is measured on every input dimension and the input dimension with highest 
separability defines the most discriminative input feature, at least according to 
the analyzed fuzzy model. All that is based on the hypothesis that the fuzzy 
model describes with sufficient accuracy the data of the training set, that is that 
a sufficiently general training set has been used for the fuzzy rules inference. 
The main advantage of analyzing fuzzy rules, instead of fuzzy rules and training 
data as in 00, consists of the highly reduced computational costs for the same 
amount of information, provided that the fuzzy model faithfully describes the 
underlying data structure. 

The detection and ranking of the most effective input variables for a given 
task could represent one of the first steps in any data analysis process. The 
implementation of a fuzzy model requires generally a short amount of time even 
in case of very high dimensional input spaces and so does the corresponding 
evaluation of the discriminative power of the input features. Whenever a more 
accurate system’s representation is wished, the analysis can continue with the 
application of more sophisticated and more computationally expensive analysis 
techniques on the most effective input features, pre-screened on the basis of the 
proposed possibilistic information. 

2 Possibilistic Feature Merit Measures 

2.1 A Possibility Measure 

Given a number m of output classes Ci, i = 1, . . . , m, and an n-dimensional 
input space, numerous algorithms exist, which derive a set of Nfj fuzzy rules 0 
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k = 1, . . . , Nn, mapping the n-dimensional input into the TO-dimensional 
output space. This set of rules models the relationships between the input data 
X G 7^" and the output classes Ct. Each input pattern x = [xi, . . . , is 
associated to each output class Ci by means of a membership value /iCi (®) • In 
figure ^a an example is reported with a two-dimensional input space {x\,X2\^ 
two output classes C\ and C2, and with trapezoids as membership functions 
^Ci and (a:) describing the relationships between the input data and the 
two output classes. 

The membership function /ic. (a:) quantifies the degree of membership of 
input pattern x to output class Ci. Its volume V{Ci), as defined in eq.^ therefore 
represents a measure of the possibility of output class Ci, on the basis of the given 
input space D C 7^". Considering normalized membership functions pLc^ix), a 
larger volume V(Ci) indicates a class of the output space with higher degree 
of possibility. An output class represented by a membership function, which 
takes value -1-1 everywhere on the input space, is always possible. A membership 
function with volume V{Ci) = 0 indicates an impossible class. 



The overall possibility of the whole output space C = {Ci, C 2 , . . . , Cm} can 
be defined through the available fuzzy mapping system { 7?^ } = { 7? 1 , i? 2 , . . ■ , Rnr } 
as the sum of all the class possibilities V {Ci), i = 1, . . . ,m. The relative contri- 
bution v{Ci) of output class Ci to the whole output space’s possibility is given 

in eq. |21 



In case the output class Ci is described by Qi > 1 fuzzy rules, the possibility 
of class Ci is given by the possibility of the union of these q = 1, ... ,Qi fuzzy 
subsets of class Ci, each with membership functions (x). The possibility of the 
union of membership functions can be expressed as the sum of their possibilities, 
taking care of including the intersection possibility only once (eq.|3). If trapezoids 
are adopted as membership functions, the possibility of each fuzzy rule Vq{Ci) 
becomes particularly simple to calculate |SI . 




( 1 ) 




(2) 



Via) = V {y% 




xGD 






Q. 



Q^ 




( 3 ) 



2.2 A Possibilistic Information Measure 

The variable v{Ci) quantifies the possibility of class Ci relatively to the possi- 
bility of the whole output space and according to the fuzzy rules used to model 
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the input-output relationships. v{Ci), as defined in eq. 0 can then be adopted 
as the basic unit to measure the possibilistic information associated with class 
Ci- With respect to a probabilistic model, the employment of the relative pos- 
sibility of class Ci, v{Ci), takes into account the possible occurrence of multiple 
classes for any input pattern x and the calculation of the relative volume v{Ci) 
is generally easier than the estimation of a probability function. 

As in the traditional information theory, the goal is to produce a possibilistic 
information measure, that is 

1. at its maximum if all the output classes are equally possible, i. e. v{Ci) = ^ 
for i = 1, ... ,m, m being the number of output classes; 

2. at its minimum if only one output class Ci is possible, i. e. in case v{Cj) = 0 
for j yf V, 

3. a symmetric function of its arguments, because the dominance of one class 
over the others must produce the same amount of possibilistic information, 
independently of which the favorite class is. 

In order to produce a measure of the global possibilistic information /(C) 
of the output space C = {Ci, . . . ,Cm}, the traditional functions employed in 
information theory - as the entropy function Ih{C) (eq.^J and the Gini function 
Ig{C) (eq. 0 m E] - can then be applied to the relative possibilities v{Ci) of 
the output classes. 



In both cases, entropy and Gini function, /(C) represents the amount of pos- 
sibilistic information intrinsically available in the fuzzy model. In particular not 
all the input features are effective the same way in extracting and representing 
the information available in the training set through the fuzzy model. The goal 
of this paper is to make explicit which dimension of the input space is the most 
effective in recovering the intrinsic possibilistic information /(C) of the fuzzy 
model. 

2.3 The Information Gain 

Given a fuzzy description of the input space {Rk} with intrinsic possibilistic 
information I{C), a feature merit measure must describe the information gain 
derived by the employment of any input feature Xj in the model. Such informa- 
tion gain is expressed as the relative difference between the intrinsic information 
of the system before, /(C), and after using that variable Xj for the analysis, 
I{C\xj), (eq.0. The Xj input features producing the highest information gains 



m 




( 4 ) 



771 




( 5 ) 
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are the most effective in the adopted model to describe the input space, and 
therefore the most informative for the proposed analysis. 



g{C\xj) 



I{C)-I{C\x,) 

I{C) 



( 6 ) 



Let us suppose that the input variable Xj is related to the output classes by 
means of a number of membership functions g^Xxj), with q — 
membership functions for every output class Ci, for i = 1, . . . ,m output classes, 
and Nn — Qt- The use of input variable Xj for the final classification con- 

sists of the definition of an appropriate set of thresholds along input dimension 
j, that allow the best separation of the different output classes. A set of cuts is 
then created on the j-th input dimension, to separate the F < Nn contiguous 
trapezoids related to different output classes. 

If trapezoids are adopted as membership functions of the fuzzy model, the 
optimal cut between two contiguous trapezoids is located at the side intersection, 
if the trapezoids overlap on the sides; at the middle point of the overlapping fiat 
regions, if the trapezoids overlap in their fiat regions; at the middle point between 
the two trapezoids, if they do not overlap. 

Between two consecutive cuts, a linguistic value Lk {k = 1, . . . , f) can be 
defined for parameter Xj. Considering Xj = corresponds to isolating one stripe 
Cfe on the input space. In stripe Ck new membership functions g!^{Ci\xj = Lk) to 
the output classes Ci are derived as the intersections of the original membership 
functions with the segment Xj = Lk- Each stripe Ck is characterized by a 

local possibilistic information I(ck) = I{C\xj = Lk) (eq. 2]or[3). The average 
possibilistic information I(C\xj)^ derived by the use of variable Xj in the fuzzy 
model, corresponds to the averaged sum of the local possibilistic information of 
stripes Ck (eq. C|). 

1 F 

I{C\x,) =pYl (7) 

fe=i 

The less effective the input feature Xj is in the original set of fuzzy rules, the 
closer the remaining I{C\xj) is to the original possibilistic information /(C) of 
the model and the lower the corresponding information gain is, as described in 
eq. El Every parameter Xj produces an information gain g{C\xj) expressing its 
effectiveness in performing the required classification on the basis of the given 
fuzzy model. The proposed information gain can be adopted as a possibilistic 
feature merit measure. 



2.4 An Example 

In figure Q an example is shown for a two-dimensional input space, two output 
classes, and with trapezoids as membership functions. The corresponding intrin- 
sic possibilistic information of the original model / (C) is reported in table [D 
The average information of the system, I(C\xi) and I{C\x 2 ), respectively after 
dimension xi and X 2 have been used for the classification, are reported in table 
Altogether with the corresponding information gains g{C\xi) and g{C\x 2 )- 
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Fig. 1. New data spaces cutting on variable b) X2 and c) xi 



A cut between the two membership functions on dimension X2 (Fig. ^b) 
produces a better separation than a cut on dimension xi (Fig. ^c)- That is the 
analysis on dimension X2 offers a higher gain in information than the analysis 
on dimension xi. This is indicated by g{C\xi) < g{C\x2) either considering 
/() as the entropy or the Gini function (Tab. E|. From the comparison of the 
information gains, g{C\x\) and g{C\x2), the analysis on variable X2 supplies 
more of the information available in the fuzzy model than the analysis carried 
on variable x\. The same conclusion could have been reached using I(C\xi) > 
I(C\x2), but an information description through the gain function produces more 
clear results than using directly the possibilistic information parameter I(C\xj). 



3 Real World Applications 

The results in the previous section show the efficiency of the proposed possi- 
bilistic feature merit measures in detecting the input dimensions with maximum 
information content. In this section some experiments on real world databases are 



Table 1. The fuzzy information measures for the two dimensional example 



Cl 


C2 


Ih{C) 


Ia{C) 


V{Ci) = 13.0 
v{Ci) = 0.51 


V{C 2 ) = 12.6 
v(C2) = 0.49 


0.99 


0.49 
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Table 2. I{C\xj) and g{C\xj) 



xi = S 


xi = L 


X2=Y 


X2 = O 


V{Ci\xi) = 0.53 
V{C2\xi) = 12.6 
v{Ci\xi) = 0.04 
v{C2\x\) = 0.96 


V{Cx\xi) = 13.0 
F(C2|xi) = 0.53 
v{Ci\x\) = 0.96 
v{C2\xi) = 0.04 


V(Ci\x 2 ) = 13.0 
V{C2\X2) = 0.00 
v{Cl\X 2 ) — 1.0 
v{C 2 \X 2 ) = 0.00 


V{Ci\x 2 ) =0.00 
V{C2\X2) = 12.6 
v(Ci\x2) = 0.00 
V{C2\X2) = 1.0 


Ih{C\xi)=Q. 2A 
Ia(C\xi) = 0.07 
5/r(C|a;i) = 0.76 
ga{C\xi) = 0.84 


Ih(C\x2) = 0.00 
Ig{C\x2) = 0.00 
gH(C\x2) = 1.0 
gG(C\x2) = 1.0 



performed and the corresponding results reported, in order to observe whether 
these possibilistic feature merit measures are actually capable to detect the 
database features which controls the maximum information even on real-world 
data. 



3.1 The IRIS Database 

The first experiment is performed on the IRIS database. This is a relatively 
small database, containing data for three classes of iris plants. The first class is 
supposed to be linearly separable and the last two classes non linearly separable. 
The plants are characterized in terms of: 1) sepal length 2) sepal width 3) petal 
length and 4) petal width. 

Both possibilistic information gains are very high for the third and the fourth 
input parameter, and almost zero for the first two input features (Tab.Oj). In |S|, 
where a detailed description of the parameters adopted in the IRIS database is 
produced, the sepal length and sepal width - parameter 1 and 2 - are reported to 
be more or less the same for all the three output classes, i. e. uninformative. Thus 
input parameter 1 and 2 should not contribute to the correct discrimination of 
the output classes. On the opposite, the petal features - parameters 3 and 4 - 
characterize very well the first class of iris (iris setosa) with respect to the other 
two. 

In this case, the proposed possibilistic feature merit measures produce a very 
reliable description of the informative power of every input parameter. Hence 
parameters 1 and 2 could be removed and the analysis performed solely on the 
basis of parameters 3 and 4 without a relevant loss of information. The class 
correlation, reported in |^, is also very high for parameters 3 and 4 and much 
lower for the first two parameters. That confirms the results from the possibilistic 
feature merit measures. 

3.2 Arrhythmia Classification 

A very suitable area for fuzzy - or more generally imprecise - decision systems 
consists of medical applications. Medical reasoning is quite often a qualitative 
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Table 3. Information gain g{C) of the iris features in the IRIS database 



7( G ) 




Xl X2 X3 X4 


Ih{C) = 1.44 
7 g ( G ) = 0.61 


9h{C) 

9g{C) 


0.10 0.06 0.82 0.81 
0.10 0.06 0.84 0.79 




Fig. 2. The ECG waveshape. 



and approximative process, so that the definition of precise diagnostic classes 
with crisp membership functions can sometimes lead to inappropriate conclu- 
sions. One of the most investigated fields in medical reasoning is the automatic 
analysis of the electrocardiogram (ECG), and inside that the detection of ar- 
rhythmic heart beats. 

Some cells (the sino-atrial node) in the upper chambers (the atria) of the car- 
diac muscle (the myocardium) spontaneously and periodically change their elec- 
trical polarization, which progressively extends to the whole myocardium. This 
periodic and progressive electric depolarization of the myocardium is recorded 
as small potential differences between two different locations of the human body 
or with respect to a reference electrode. An almost periodic signal, the ECG, 
that describes the electrical activity of the myocardium in time, is the result. 
Each time period consists of a basic waveshape, whose waves are marked with 
the alphabet letters P, Q, R, S, T, and U (Fig. El). The P wave describes the de- 
polarization process of the two upper myocardium chambers, the atria; the QRS 
complex all together the depolarization of the two lower myocardium cham- 
bers, the ventricula; and the T wave the repolarization process at the end of 
each cycle. The U wave is often absent from the beat waveshape and, however, 
its origin is controversial. The heart contraction follows the myocardium depo- 
larization phase. Anomalies in the PQRST waveshape are often connected to 
misconductions of the electrical impulse on the myocardium. 
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Table 4. Set of measures characterizing each beat waveshape. 



RR 


RR interval (ms) 


RRa 


average of the previous 10 RR intervals 


QRSw 


QRS width (ms) 


VR 


Iso-electric level (/rV) 


pA 


Positive amplitude of the QRS (/rV) 


nA 


Negative amplitude of the QRS (/rV) 


pQRS 


Positive area of the QRS {fiY * ms) 


nQRS 


Negative area of the QRS (/rV * ms) 


pT 


positive area of the T wave (pV * ms) 


nT 


negative area of the T wave (pV * ms) 


ST 


ST segment level (/rV) 


STsl 


slope of the ST segment (^V /ms) 


P 


P exist (yes 0.5, no -0.5) 


PR 


PR interval (ms) 



A big family of cardiac electrical misfunctions consists of arrhythmic heart 
beats, deriving from an anomalous (ectopic) origin of the depolarization wave- 
front in the myocardium. If the depolarization does not originate in the sino- 
atrial node, a different path is followed by the depolarizing wavefront and there- 
fore a different waveshape appears in the ECG signal. Arrhythmia are believed 
to occur randomly in time and the most common types have an anomalous 
origin in the atria (SupraVentricular Premature Beats, SVPB) or in the ventric- 
ula (Ventricular Premature Beats, VPB). With the development of automatic 
systems for the detection of QRS complexes and the extraction of quantitative 
measurements, large sets of data can be generated from hours of ECG signal. A 
larger number of measures though does not guarantee better performances of the 
upcoming classifier, if no significant new information is added. A pre-screening of 
the most significant measures for the analysis has the double advantage of low- 
ering the input dimension and of improving the classifier’s performance when 
poor quality measures are discarded. 

The MIT-BIH database flij represents a standard in the evaluation of meth- 
ods for the automatic classification of the ECG signal, because of the wide set 
of examples of arrhythmic events provided. The MIT-BIH ECG records are two- 
channel, 30 minutes long and sampled at 360 samples/s. Two records (200 and 
233) from the MIT-BIH database are analyzed in this study, because of their 
high number of arrhythmic beats. QRS complexes are detected and for each 
beat waveshape a set of 14 measures CD is extracted by using the first of the 
two channels in the ECG record (Tab. EJ- The first 2/3 of the beats of each 
record are used as training set and the last 1/3 as test set. A two-class, normal 
(N) vs. ventricular premature beats (VPB) is considered for record 200 and a 
three-class problem (N, VPB, and SVPB) for record 233, in order to quantify 
the discriminative power of the input features for both classification tasks. 
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Table 5. Information gain for different ECG beat measures (record 200). The 
amounts of correctly classified N and VPB and of uncertain beats are expressed 
in %. 





RR RRa QRSw VR pA nA pQRS 


nQRS pT nT ST STsl P 


PR 


N 


VPB 


unc. 


9H 


.40 


.00 


.78 


.09 


.07 .00 


.08 


.04 


.00 .61 .57 


.01 .00 


.00 


99 


97 


1 


go 


.42 


.01 


.80 


.11 


.09 .01 


.10 


.05 


.00 .63 .59 


.02 .00 


.00 




gn 


.47 


- 


.42 


.14 


.25 .25 


.38 


.21 


- .15 .36 


.17 - 
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99 


96 


1 


go 


.53 
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.44 


.18 


.25 .27 


.43 


.26 


- .16 .38 
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.74 


- 


.08 


- 


.44 .42 


.49 


.28 


- - .03 


- 


- 


100 


97 


1 


go 


.81 


- 


.09 


- 


.48 .43 


.55 


.31 


- - .04 


- 


- 
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.59 .78 
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At first all 14 measures are used for classification. The corresponding informa- 
tion gains 5 h(C) and ga{C) are listed in table 0, together with the percentages 
of correctly classified and uncertain beats on the test set, for record 200. Beats 
are labeled as uncertain if they are not covered by any rules of the fuzzy model. 
The percentage of uncertain beats (unc.) is defined with respect to the number 
of beats in the whole test set. The parameter with highest information gain is 
marked in bold. The ECG measures with smallest information gains are then 
progressively removed from the classification process. A similar table can be 
obtained for record 233. 

Ventricular arrhythmia are mainly characterized by alterations in the QRS 
complex and T wave rather than in the PR segment. VPBs usually present a 
larger and higher QRS complex, and to a lower extent an alterated ST segment. 
In table El some ECG measures produce from the very beginning no information 
gain, such as the presence of the P wave (P), the average RR interval of the 
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previous 10 beats (RRa), and the PR interval (PR) as it was to be expected. 
Only 4-6 EGG features are characterized by a high information gain, that is are 
relevant for the classification process. An almost constantly used feature is the 
RR interval, that quantifies the prematurity of the beat and it is usually a sign 
for general arrhythmia. Also the QRS width, the positive and negative amplitude 
of the QRS complex and the corresponding areas, all parameters related to the 
QRS complex shape, play an important role, individually or together, in the 
classification procedure. If many input parameters are used, T wave features 
provide helpful information for classification, but they loose importance if no 
redundant input information is supplied. The low informative character of the 
past RR intervals, through the low information gain of the RRa parameter, 
confirms the unpredictability of VPBs. Individually, the positive and negative 
amplitude of the QRS complex present the highest information gain, confirmed 
by the highest performance on the test set, followed by the RR interval, the QRS 
width, and the QRS positive area. 

All the estimated discriminative powers in table l^hnd positive confirmation 
in clinical VPB diagnostics. The redundant or uninformative character of the 
input features with lowest information gain is proven by the fact that their 
removing does not affect the final performance on the test set, as long as at least 
two of the most significant EGG measures are kept. Indeed the same performance 
on the test set are observed both with the full input dimensionality and removing 
the least significant EGG measures. 

Record 233 presents a new class of premature beats with supraventricular 
origin (SVPB) and a more homogeneous class of VPBs. Supraventricular ar- 
rhythmia can be differentiated from normal beats mainly by means of the RR 
interval and the PR segment, whenever the P wave can be reliably detected. 
Gonsequently the analysis of record 233, with respect to the analysis of record 
200, shows a high information gain also for the PR measure, besides the negative 
amplitude and area of the QRS complex and the RR interval already used for 
VPB classification. However, if considered individually, none of the EGG mea- 
sures produces a high information gain and good performance on the test set for 
all classes of beats. The PR interval shows to be useless if used alone for SVPB 
classification, but it gains a high discrimination power if any other significant 
EGG measure is added. The negative amplitude of the QRS complex and the 
RR interval alone show to be still highly discriminative for N /VPB classification, 
but helpless for SVPB recognition. 

4 Conclusions 

A methodology to estimate the discriminative power of input features based 
on an underlying fuzzy model is presented. Because of the approximative na- 
ture of fuzzy models, many algorithms exist to construct such models quickly 
from example data. Using properties of fuzzy logic, it is easy and computation- 
ally inexpensive to determine the possibilistic information gain associated with 
each input feature. The algorithm capability is illustrated by using an artifi- 



98 



Rosaria Silipo and Michael R. Berthold 



cial example and the well-known IRIS data. The real-world feasibility was then 
demonstrated on a medical application. 

The defined information gain provides a description of the class discrim- 
inability inside the adopted fuzzy model. This is related with classification per- 
formances, only if the fuzzy model was built on a sufficiently general set of train- 
ing examples. The proposed algorithm represents a computationally inexpensive 
tool to reduce high-dimensional input spaces as well as to get insights about the 
system through the fuzzy model. For example, it can be used to determine which 
input features are exploited by fuzzy classifiers with better performance. 

We believe that especially for large scale data sets in high dimensional feature 
spaces, such quick approaches to gain first insights into the nature of the data will 
become increasingly important to successfully find the underlying regularities. 
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Abstract. This paper presents our first efforts toward learning sim- 
ple logical representations from robot sensory data and thus toward a 
solution for the perceptual grounding problem |5]. The elements of rep- 
resentations learned by our method are states that correspond to stages 
during the robot’s experiences, and atomic propositions that describe 
the states. The states are found by an incremental hidden Markov model 
induction algorithm; the atomic propositions are immediate generaliza- 
tions of the probability distributions that characterize the states. The 
state induction algorithm is guided by the minimum description length 
criterion: the time series of the robot’s sensor values for several expe- 
riences are redescribed in terms of states and atomic propositions and 
the model that yields the shortest description (of both model and time 
series) is selected. 



1 Introduction 

We are interested in learning without supervision elements of logical representa- 
tions of episodes. The episodes in question are generated by robots interacting 
with their environments. Just as human infants bootstrap their sensorimotor ex- 
periences into a conceptual structure and language 0 , so we want our robot to 
learn ontologies and language through interaction. Previous work has focused on 
learning sensory prototypes, which represent robot interactions in terms of how 
the interactions appear to the sensors 0. For example, driving toward a wall 
and bumping into it is represented as a decreasing series of sonar values followed 
by the bump sensor going high. While sensory prototypes support some kinds 
of reasoning (e.g., predicting that the bump sensor will go high) they do not 
contain explicit elements that represent the robot, the wall, and the act of driv- 
ing; and so do not support reasoning about the roles of entities in episodes p. 
This work takes the first step from sensory prototypes to logical representations. 
Logical representations have two advantages: 

— Because they contain terms that denote the entities in a scene and the rela- 
tionships between them, logical representations such as 
“push(robot, object)” are compact, and easily support planning and other 



D.J. Hand, J.N. Kok, M.R. Berthold (Eds.): IDA’99, LNCS 1642, pp. 99-fiilJ 1999. 
t Springer- Verlag Berlin Heidelberg 1999 



100 



Laura Firoiu and Paul Cohen 



reasoning. The sensory prototype of pushing objects does not support these 
easily 0. 

— Abstraction can be over predicates and properties of entities, rather than 
over patterns in sensory traces. For instance, the extensional category of 
pushable objects is the set of elements i such that the robot has experienced 
“push(robot, j)” in the past. Given the extensional category, one can imagine 
learning the intensional concept of pushable object, the properties that make 
objects pushable. Neither kind of categorization is feasible given only sensory 
prototypes. 

If logical representations are so advantageous, why not build them into our 
robots, that is, make them part of the robots’ innate endowment? The rea- 
son is that we want to explain how sensorimotor activity produces thought — 
classification, abstraction, planning, language — as it does in every human infant. 
So we start with sensors and actions, and in this paper we explain how elements 
of a logical representation might be learned from these sensorimotor beginnings. 

The first step in the process of learning logical representations is to re- 
describe the episodes as state sequences. Our intuition is that experiences unfold 
through several relatively static stages. At least for simple robot activities, the 
robot’s world tends to remain in the same state over some periods of time, so we 
expect the state sequences to be simple. For example, the experience of moving 
toward an object has some well defined stages: accelerating, approaching, being 
near the object. We want to identify the states that correspond to these stages 
and ground them in patterns of sensor values. 

A technique that allows identification of states is that of hidden Markov 
model (HMM) induction. The assumption behind the HMM is that the data 
sequence is produced by a source that evolves in a state space and at each time 
step outputs a symbol according to the probability distribution of the current 
state. The states are thus characterized by stable probability distributions over 
the output alphabet. We identify the episode stages with the states of an HMM 
induced from all the data collected during a batch of episodes. Since they form 
a single vocabulary for all episodes, similar stages can be identified across expe- 
riences. 

The second step in the process of learning logical representations is to find 
atomic propositions that denote facts in the current state. Since the states found 
by the HMM are characterized by probability distributions, the atomic propo- 
sitions must be derived from them. We define the atomic propositions simply 
as disjunctions of the most likely sensor values according to these distributions. 
For example, the characterization of “accelerate” is given by some positive val- 
ues of the acceleration sensor and by the velocity sensor varying within a range 
of values. A representation of an episode becomes a sequence of states described 
by these atomic propositions. 

These representations are “passive” in the sense that, currently, they are not 
used by any problem solving system. These representations do not specify what 
to do in a certain situation or predict what will happen if an action is taken. In 
the absence of supervision and a problem solving task, we choose the principle 
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of minimum description length to guide the learning process. This principle is 
implemented with the help of a cost function that measures both the size of the 
representations (atomic propositions, states, episodes as state sequences) and 
how well they describe the raw data. Our algorithm identifies the states and the 
atomic propositions that heuristically minimize the cost of these descriptions. 

2 Identifying Experience Stages with Hidden Markov 
Models 

2.1 Hidden Markov Models 

A discrete hidden Markov model |0| is defined by a set of states and an alphabet 
of output symbols. Each state is characterized by two probability distributions: 
the transition distribution over states and the emission distribution over the out- 
put symbols. A random source described by such a model generates a sequence 
of output symbols as follows: at each time step the source is in one state; after 
emitting an output symbol according to the emission distribution of the current 
state, the source “jumps” to a next state according to the transition distribution 
of its current state. The activity of the source is observed indirectly, through the 
sequence of output symbols. A continuous HMM emits symbols from a continu- 
ous space, according to probability densities instead of probability distributions. 
For either discrete or continuous HMMs, efficient dynamic programming algo- 
rithms exist that: 

— induce the HMM that maximizes (locally) the probability of emitting the 
given sequence (the Baum- Welch algorithm) 

— find the state sequence that maximizes the probability of the given sequence, 
when the model is known (the Viterbi algorithm). 

The HMM model definition can be readily extended to the multidimensional 
case, where a vector of symbols is emitted at each step, instead of a single symbol. 
The simplifying assumption that allows this immediate extension is conditional 
independence of variables given the state. 

2.2 Input Preprocessing 

We collected time series of sensor values from a Pioneer 1 robot. The robot has 
about forty sensors, and almost all of them return continuous values. While a 
continuous HMM appears more appropriate for this domain we chose discrete 
HMMs because our simple method of inducing atomic propositions works readily 
for probability distributions but not for probability densities. The sensor vari- 
ables are discretized independently with unidimensional Kohonen maps |3|. Each 
continuous input value is mapped to one unit and the resulting symbols are the 
map units. 

Not all of the robot’s sensors are relevant to our experiments. Besides slowing 
down considerably the HMM induction algorithm, the irrelevant sensors intro- 
duce noise that leads the algorithm into creating meaningless states. We selected 
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the sensors that we considered important and discarded the others from the sen- 
sor vector. 

The sensor values are “jittery” and can bias the state induction algorithm 
toward frequent state changes. We correct this bias with one of our own: in a 
stable world, sensor values remain constant or change in a regular, not jittery, 
way. To introduce this bias, we create new variables by calculating the slopes 
(derivatives) of selected sensor variables and adding them to the sensor vec- 
tor. The slopes are calculated by fitting lines piecewise to the time series. The 
algorithm has two steps: 

1. Initialization: create a graph such that: 

— there is a node for each known point on the curve (time stamp); 

— there is one arc between any two distinct nodes; the arc points to the 
node with higher time stamp; the weight of the arc is the mean square 
error of the regression line fitted to the curve fragment defined by the 
two nodes (time stamps); 

2. Find the shortest path in the above graph between the nodes corresponding 
to the first and last time step (Dijkstra’s alg.). 

The path calculated at step 2 defines a piecewise linear fit with the property that 
the sum over the individual fragments of the mean square error is minimized. 

2.3 State Splitting Algorithm for HMM Induction 

A limitation of HMM induction algorithms is that the number of states must 
be known in advance. Often, there are either too few states and the resulting 
propositions are too vague (for example a sensor can take any value) or there are 
too many states and propositions, such that the representation of experiences 
becomes long and not intelligible. Since we consider good representations to be 
“short” representations, our algorithm splits states as required to minimize the 
size of these representations, as measurecQby a cost function. We designed the 
cost function according to the minimum message length (MML) principle |SI, as 
a measure of the information needed to re-generate the original data (the time 
series of the experienced episodes). As in the MML paradigm, the robot must 
store two pieces of information. The first is its model, that is the collection of 
atomic propositions and states. The second is the encoding of each episode’s 
time series by taking advantage of the model. 

The cost function is a sum of two components: the model cost and the data 
cost. The cost of the model is a measure of the length of the model description. 
The data cost is a measure of the size of all the episode encodings. The two cost 
components are presented in section 12.51 

The state induction algorithm proceeds by recursively splitting states and 
re-training the resulting HMM until the cost cannot be improved: 

^ The cost function is not the exact length of the encoded information, but a measure 
of it. For example we ignore string delimiters or the exact number of bits when 
defining the cost of encoding a number n as login). 
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1. initialization: the HMM has only one state 

2. iterate while cost is decreasing: 

— for each state, compute the cost resulting from splitting the state 

— select the state that yields the largest cost reduction and split it 

State splitting stops because for the data cost to decrease, the model cost 
must increase , so the total cost cannot decrease indefinitely. By choosing to 
split the state that yields the largest cost reduction at the current iteration, the 
cost is minimized heuristically. We cannot attempt to minimize the cost globally, 
because an exhaustive search of all the splitting possibilities is exponential in 
the final number of states, and the HMM fitting algorithm is guaranteed to find 
only a local maximum, anyway. 

2.4 State Characterization with Atomic Propositions 

To characterize an HMM state by a set of logical propositions, we replace for each 
sensor the probability distributions over its values with logical descriptions of the 
distributions. These descriptions are disjunctions of the most likely sensor values, 
that is the values that have a probability higher than a certain threshold. An 
example of a proposition based on the distribution of the translational velocity 
(trans-vel) sensor is: 

distribution: 0 0 0 0 0 0.14 0.33 0.54 0 

atomic proposition: transjuelJyJiJI 

In the example above, the proposition definition covers values 5 through 7. We 
consider that all the values in the proposition definition are equally likely to 
occur in a state in which the proposition holds. Thus, the proposition is defined 
as a generalization of the distribution from which it was derived to the uniform 
distribution over the covered values. This crude generalization reduces the prolif- 
eration of propositions and allows identification of common propositions across 
states. 

Propositions are thus simple facts of the form “sensor S takes values x or y” . 
Given a sensor model that describes the kind of information a sensor returns, 
we can transform these propositions into predicates. For example, if the sensor 
model specifies that the translational velocity sensor returns the translational 
velocity property of the constant robot, then the proposition trans — ueL5_6_7 
becomes the predicate trans — velhJSJ {robot). We can assume that for a simple 
experience, a sensor returns information about the same object throughout the 
experience. Transforming propositions into predicates and then composing them 
into more complex representation is the focus of our future work. 

2.5 The Model and Data Encoding Costs 

The model is a set of atomic propositions and states. We encode it by concate- 
nating the descriptions of states and atomic propositions. An atomic proposition 
is described by enumerating the values it covers and its cost is: 

# covered-values * log{^ all sensor -values). 
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The description of a state Si has two parts. The first part specifies the codes for 
next states, according to its transition probability distribution. These codes are 
used in the encoding of time seriesR as follows: if the current state is Si and the 
next state is Sj, then the optimal |j code for Sj, given si, has —log{prob{sj\si)) 
bits. The cost for all next state codes is — log{prob{sj\si)). If a transi- 

tion probability prob{sk\si) is 0, then we replace the k-th term in the sum with 
log{^states). 

The second part of a state encoding is its characterization with atomic proposi- 
tions. This cost is defined as: # propositions -instate * log{^ all -propositions). 
The model cost is the sum of the costs of the descriptions of propositions and 
states. 

The time series of experiences are encoded as the most likely state sequences 
in the induced HMM. A state specifies the set of atomic propositions that hold 
in the state and these propositions carry information about the sensor values. 
The propositions generalize over the distributions from which they were derived 
and lose information that was present in the distributions. Consequently, the 
propositions may be inaccurate, meaning they specify incorrect sensor values, 
or imprecise, meaning they specify a range of sensor values. For example, if a 
propositional characterization of an HMM state says “translational velocity is 
2, 3, or 4.” and the robot’s translational velocity in the state is actually 5, then 
the proposition is inaccurate. If translational velocity is 3, then the proposition 
is imprecise. 

To re-generate a time series of sensor values from logical state descriptions, 
one would have to store additional information, either for specifying one of the 
covered values when the proposition is not precise, or for correcting errors when 
the proposition does not hold at that time step. The cost of an individual experi- 
ence is defined to include both the size of its encoding as a state sequence within 
the model and the additional information required for correcting the description, 
if necessary. Specifically, the cost is a sum over all time steps of: 

1. the length of encoding with the optimal code the current state s{t), given 
the previous state s{t — 1); as discussed above, this cost is either 
—log(j>rob{s{t)\s{t— 1))) or log{^ states). 

2. the length of encoding the sensor vector at the current time step, given the 
current state; for each sensor this component of the cost is either 

log{=ff covered-values) if the proposition is imprecise, or 
log{=ff all sensor -Values) if the proposition is inaccurate 



^ Although these codes appear in the redescription of experiences, they must be spec- 
ified in the model description because otherwise the encoded experiences cannot be 
decoded. 

^ We do not have to specify what this optimal code. For our cost function, we need to 
know only its length. 
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3 Experiment 

3.1 Experiment Setting 

Sensor value time series were collected from twelve simple experiences of the 
Pioneer 1 robot. The experiences fall into four categories: pass object on right, 
pass object on left, push object and approach object. There are three experiences 
of each kind. For all experiences, the object is perceived by the visual channel 
A, which was calibrated to detect blue objects. The perceived object will be 
referred from now on as “object A” . The data were collected 0 in a less noisy 
environment, with the robot executing forward motions along an almost empty 
corridor. The noise reduction proved to be beneficial: no spurious objects - that 
usually mislead the state splitting algorithm - were detected in the visual field. 

From the forty or so sensors of the Pioneer 1 robot, we selected six that we 
consider relevant for describing the twelve experiences in our experiment. These 
are: 



— “trans-vel” is the robot’s translational velocity 

— “vis-A-area” is the area occupied by the object in the channel A visual field 

— “vis-A-x” and “vis-A-y” are the coordinates in the visual field of object A 

— “grip-front-beam” and “grip-rear-beam” return 1 when an object is between 
the two gripper arms and 0 otherwise 

The slopes (derivatives) of the first four sensors were also added, yielding 
four more variables: “trans-acc” is the derivative of translational velocity and 
“diff-vis-A-xxx” are the derivatives of the visual sensors. 

The sensor values were discretized with Kohonen maps, one unidimensional 
map for each sensor variable. Figure [D shows the resulting discretization for 
the visual A area sensor. As it can be seen in this figure, the map is topo- 
logically ordered, that is value{map unit 0) < value(jnap unit 1) < ... < 
value{map unit 8). Topological ordering is a property of unidimensional Koho- 
nen maps, so the maps of all sensors are ordered. Due to this property, we can 
easily interpret the atomic propositions. For example, the atomic proposition 
“vis-A-area. 0-1” tells us that a small object is seen in visual channel A, while 
“vis-A-area. 7-8” signals the presence of a large object. 



3.2 Results 

The results of the state splitting algorithm for two of the twelve experiences, and 
the corresponding partitioning into stages are shown in fig.0 For the states that 
occur during these two experiences, table 0 lists their probability distributions 
over sensor values and the induced atomic propositions. The most likely HMM 
state sequences for all experiences are shown in table 0 

We thank Zack Rubinstein for providing the data and for collecting them in a less 
noisy environment. 
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Fig. 1. Discretization of the “vis-A-area” sensor with a linear Kohonen map. 
The map has 9 units, uO through u8, thus yielding 9 discrete symbols. The plot 
shows the values of the map units and the approximative intervals of sensor 
values allocated to each unit. It can be noticed that most of the sensor values 
are mapped to the first three units, while the last two units get only one value 
each. 

Table 1. The middle column shows the most likely HMM state sequences for the 
twelve experiences and the right column shows their corresponding compressed 
stage sequences. In the compressed stage sequences, Ci stands for a composite 
stage and Si stands for a simple stage. 
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We can see from figure El and from table Q] that we can indeed identify a 
contiguous run of one HMM state, sf , with an experience stage - call it Si. Fur- 
thermore, some pairs of stages, for example (si S5) appear quite frequently. Such 
frequent pairs can be merged into composite stages. By replacing subsequences 
of simple stages with composite stages, even more simplified redescriptions of 
experiences are obtained. In order to explore this possibility, we implemented a 
simple compression algorithm that creates composite stages, guided again by the 
minimum description length principle. The description that must be minimized 
has two parts: the description of composite stages in terms of simple stages and 
the redescription of each individual experience with both composite and simple 
stages. The cost of each part is a measure of its description length. Creating a 
new composite stage has the effect that the cost of the first part increases, while 
that of the second part decreases. Therefore, the total cost, which is the sum of 
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Fig. 2. HMM state fragmentation for two experiences: the left column contains 
the plots from a “push A” move and the right column from an “approach A” 
move. The units on the x-axis are time steps and the units on the y-axis are 
discretized sensor values. 
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Table 2. The states and atomic propositions that occur in the two experiences in figure 
13 The states are listed in their order of appearance: (ss si ss S 4 se) for “approach 
A”, and (S3 si S5 S4 se S7) for “push A”. An atomic proposition like vis-A-x.3.5.6 
means that the “vis-A-x” sensor mostly takes values from the set {3, 5, 6 }, while “vis- 
A-area.5-8” means that the “vis-A-x” sensor takes values in the range 5 ... 8 . 



state 


atomic proposition 


probability distribution 


interpretation 




trans-vel. 0-8 
trans-acc.0. 1.5.6 


.38 .01 .01 .03 .01 .01 .18 .28 .09 
.06 .81 .00 .00 .00 .06 .06 .00 .00 


either accelerated or 
constant move 


S3 


grip- front-beam. 0 
grip-rear-beam . 0 


1.0 .00 
1.0 .00 


no object within 
gripper arms 




vis-A-area.O 
vis-A-x. 4 
vis-A-y.2 


1.0 .00 .00 .00 .00 .00 .00 .00 .00 

.00 .00 .00 .00 1.0 .00 .00 .00 .00 

.00 .00 1.0 .00 .00 .00 .00 .00 .00 


very small object 
in the lower central 
region of the visual field 




trans-vel.0.1.6.7 

trans-acc. 1-2 


.24 .05 .00 .00 .00 .00 .46 .24 .00 
.00 .90 .10 .00 .00 .00 .00 .00 .00 


mostly constant move 
at high speed 


Si 


grip- front-beam. 0 
grip-rear-beam . 0 


1.0 .00 
1.0 .00 


no object within 
gripper arms 




vis-A-area.0-1 
vis-A-x. 2-6 
vis-A-y.2-3 


.10 .90 .00 .00 .00 .00 .00 .00 .00 
.00 .00 .39 .02 .34 .17 .07 .00 .00 
.00 .00 .24 .76 .00 .00 .00 .00 .00 


small object 
in the lower central 
region of the visual field 




trans-vel.5-8 

trans-acc. 1.8 


.00 .00 .00 .00 .00 .05 .50 .35 .10 
.00 1.0 .00 .00 .00 .00 .00 .00 .00 


constant move 
at high speed 


S5 


grip- front-beam. 0 
grip-rear-beam . 0 


1.0 .00 
1.0 .00 


no object within 
gripper arms 




vis-A-area.l 

vis-A-x.1.2.4.5.6.7 

vis-A-y.3-4 


.00 1.0 .00 .00 .00 .00 .00 .00 .00 
.00 .02 .30 .00 .40 .04 .20 .04 .00 
.00 .00 .00 .78 .22 .00 .00 .00 .00 


small object 
somewhere in the lower 
region of the visual field 




trans-vel.5-8 

trans-acc.l 


.00 .00 .00 .00 .00 .16 .09 .63 .12 
.00 1.0 .00 .00 .00 .00 .00 .00 .00 


constant move 
at high speed 


S4 


grip- front-beam. 0 
grip-rear-beam . 0 


1.0 .00 
1.0 .00 


no object within 
gripper arms 




vis-A-area.2-3 
vis-A-x. 1.4.6 
vis-A-y.3-5 


.00 .00 .70 .30 .00 .00 .00 .00 .00 
.00 .23 .00 .00 .58 .00 .19 .00 .00 
.00 .00 .00 .14 .70 .16 .00 .00 .00 


small object 
in the central 
region of the visual field 




trans-vel.5-8 

trans-acc.l 


.00 .00 .00 .00 .00 .05 .50 .35 .10 
.00 1.0 .00 .00 .00 .00 .00 .00 .00 


constant move 
at high speed 


S6 


grip-front-beam. 0 
grip-rear-beam . 0 


1.0 .00 
1.0 .00 


no object within 
gripper arms 




vis-A-area.4-6 
vis-A-x. 4-5 
vis-A-y.5-6 


.00 .00 .00 .00 .45 .45 .10 .00 .00 
.00 .00 .00 .00 .65 .35 .00 .00 .00 
.00 .00 .00 .00 .00 .60 .40 .00 .00 


medium sized object 
in the upper central 
region of the visual field 




trans-vel.5-7 

trans-acc.l 


.00 .00 .00 .00 .00 .12 .25 .62 .00 
.00 1.0 .00 .00 .00 .00 .00 .00 .00 


constant move 
at high speed 


S7 


grip- front-beam. 1 
grip-rear-beam. 0-1 


.00 1.0 
.38 .62 


object present within 
gripper arms 




vis-A-area.5-8 

vis-A-x.3.5.6 

vis-A-y.7-8 


.00 .00 .00 .00 .00 .38 .38 .12 .12 
.00 .00 .00 .25 .00 .13 .62 .00 .00 
.00 .00 .00 .00 .00 .00 .00 .25 .75 


large object 
in the upper central 
region of the visual field 
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the costs of the two parts, may decrease as the result of creating a new compos- 
ite stage. The algorithm continues to merge stages greedily, until the total cost 
stops decreasing. The results are shown in table ^ in the column “compressed 
stage sequence” . 

It can be noticed in tabled that every “approach- A” experience is described 
by one composite stage, C12, and every “push- A” experience is described by 
the same C12, followed by the simple stage S7. It can be seen in table that 
state S7, which defines stage S7, is the only one to be characterized by the 
atomic propositions “grip- front-beam. 1” and “grip-front-beam. 0-1” . These two 
propositions tell us that the robot is in “contact” with an object (the object 
is within the gripper arms). While it is obvious to us that “contact” is the 
difference between an “approach” and a “push”, the algorithm does not get 
explicit information about the differences between experiences, and does not 
have the explicit goal of finding them. It is interesting then, that the minimum 
description length principle led to a re-representation of experiences that makes 
this distinction apparent. 

It can be also noticed that the stage sequences allow a good clustering of ex- 
periences: the first two “pass-right- A” experiences share the same stage sequence 
and so do all the “push- A” and respectively, “approach- A” , experiences. 

While the above remarks are encouraging for the validity of our approach 
- applying the minimum description length criterion for inducing meaningful 
elements of representation - we can see in figure El that our algorithm fails to 
identify the acceleration stage for the two presented experiences. Although the 
first state in the sequence, S3, is the only one that assigns nonzero probabilities 
to high acceleration values, the transition to the next state, si, is not triggered 
by the change in the acceleration regime, but by the change in the “vis-A- 
area” sensor from value 0 to 1. As a matter of fact, it can be noticed that for 
both experiences, there are other state changes triggered by this sensor as well: 
S5 — > S4 occurs when “vis-A-area” becomes 2 and S5 — > S4 when “vis-A-area” 
becomes 4. This indicates that the partitioning of these experiences into stages 
is mostly determined by the visual area of the object, and that the stages are 
identified with different degrees of closeness to the object. While this partitioning 
is not meaningless, it does not distinguish the important acceleration stage. The 
main reason is that the algorithm has no measure of the relative importance of 
the sensor variables, other than the reduction in description length obtained by 
distinguishing states based on their values. Another reason is that, as discussed 
in section O the cost of the description cannot be minimized globally. 

4 Conclusions and Future Work 

During the first year of an infant’s life, she apparently develops increasingly rich 
and efficient representations of her environment (Mandler calls this process re- 
description 0). We have shown how to re-describe multivariate time series of 
sensor values as rudimentary logical descriptions, by creating new objects that 
are associated with parts of the world at different abstraction levels. The objects 
at one level are grounded in, or mapped to, objects at the previous level. Because 
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both memory and time are finite resources, the criterion of simple (short) descrip- 
tions must govern the process. In this work we tried to apply these ideas at the 
lowest levels of abstractions, by creating atomic propositions grounded in prob- 
ability distributions over raw sensor values (physical level). The fragmentation 
of time series into states and their corresponding propositional characterizations 
often appear to agree with our interpretation of the evolution of experiences. 
But this fragmentation is not perfect: for example, as discussed in the previ- 
ous section, there is no distinct “acceleration” stage, because the algorithm has 
no information that the acceleration sensor is more “important” than others. 
Meaningful representations must not be only simple, but also useful (Q). We 
consider useful the elements of representations that predict the outcome of an 
experience, predict when a state change occurs, explain the differences between 
experiences or explain reward. Our next goal is to define the utility criterion for 
representation elements and redesign the learning algorithm to incorporate both 
the utility and the minimum description length criteria. 
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Abstract. Regularities exist in datasets describing spatially distributed 
physical phenomena. Human experts often understand and verbalize the 
regularities as abstract spatial objects evolving coherently and interact- 
ing with each other in the domain space. We describe a novel com- 
putational approach for identifying and extracting these abstract spa- 
tial objects through the construction of a hierarchy of spatial relations. 
We demonstrate the approach with an application to finding troughs in 
weather data sets. 



1 Introduction 

In analyzing spatial datasets such as weather data or fluid motion, experts of- 
ten perceive and reason about these physical fields in terms of abstract spatial 
objects (often described by the so-called features or patterns) that evolve and 
interact with each other. The benefits of doing this is at least twofold: (1) The 
fields are labeled by aggregate properties describing the macroscopic behaviors 
of the underlying phenomena so that the fields can be understood and manip- 
ulated on a scale more abstract than the point- wise description. (2) Just like 
the real-world phenomena they represent, the perceived objects are generally 
persistent both spatially and temporally and hence allow experts to understand 
them intuitively using common sense. 

For example, when analyzing weather data, meterologists can perceive aggre- 
gate weather features such as high/low pressure centers, pressure troughs, ther- 
mal packings, fronts and jet streams and label them explicitly on the weather 
charts, as shown in Fig.d The experts then use weather rules to correlate these 
features and establish prediction patterns. Here is a sample of weather prediction 
rules [TT| : 

— “At 850mb, the polar front is located parallel to and on the warm side of 
the thermal packing.” 

— “Major and minor 500mb troughs are good indicators of existing or potential 
adverse weather.” 

— “Lows tend to stack toward colder air aloft while highs tend to stack toward 
warmer air aloft.” 



D.J. Hand, J.N. Kok, M.R. Berthold (Eds.): IDA’99, LNCS 1642, pp. 111-^23 1999. 
(c) Springer- Verlag Berlin Heidelberg 1999 
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Fig. 1. Weather Features: (a) a thermal packing; (b) a pressure trough, where 
solid lines representing iso-bars and dashed line trough position; (c) highs “H” 
and lows “L” in a pressure map. 



Modeling the abstraction and reasoning processes of scientists in these appli- 
cation domains is a goal of qualitative reasoning and common sense reasoning. 
The abstraction step extracts salient descriptions as spatial objects from raw spa- 
tial datasets. Without including this step, artificial systems that reason about 
physical systems will have to rely on domain experts to provide basic inputs. 
In the terminology of MD/PV in qualitative reasoning j^j, this step corresponds 
to building place vocabulary (i.e., abstract spatial objects) from metric diagram 
(i.e., spatial datasets sampling the underlying physical fields). 

The object-centered ontologies in qualitative reasoning assume the existence 
of base objects in the first place. Lundell’s work deals with the physical fields, 
but the objects studied are only iso-clusters — sets of connected points with 
qualitatively equivalent parameter values [7j. Iso-clusters are easy to compute 
but inadequate for modeling more complex spatial objects, such as troughs and 
fronts in the weather analysis domain. 

The spatial objects perceived by scientists have a common characteristics: 
they are all visually salient, at least to the trained eyes. On the other hand, 
these objects often do not admit a well-defined mathematical characterization 
with a small number of parameters. The identification and extraction of spatial 
objects share many similarities with the recognition and figure-ground separation 
problems in computer vision. 

In this paper we develop a general approach to extracting abstract spatial ob- 
jects from spatial datasets. It is built upon the framework of Spatial Aggregation 
(SA) that features a multi-layer representation and a set of generic operators to 
transform the representation m- This work extends the SA framework in three 
ways: 

— It emphasizes the importance of internal structural information about spatial 
objects. A spatial object is not just a collection of constituent objects but 
with a rich internal structure that may influence the aggregate properties of 
the object including its identity. 

— It classifies neighborhood relations into strong adjacencies and weak adja- 
cencies. This classification explicates the connection between structural in- 
formation and neighborhood relations and the connection between low-level 
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and high-level neighborhood relations, and enables high-level neighborhood 
relations to be built from primitive ones. 

— It provides an algorithm for extracting structured spatial objects and extends 
the set of generic SA operators. 

We have applied this approach to the trough extraction problem in weather 
analysis and obtained promising results. 

The rest of the paper is organized as follows. Section 2 presents the proposed 
approach. Section 3 describes the trough application. The experimental results 
are shown in Section 4 and the related work is discussed in section 5. 

2 Extracting Structured Spatial Objects 

In this section, we first briefly review the Spatial Aggregation framework. We 
then introduce the notion of strong adjacency and weak adjacency as a refine- 
ment of neighborhood relations and discuss its use in aggregating objects and 
their neighborhood relations. Last we present a structure finding algorithm that 
incrementally extracts structured objects from spatial datasets. 



2.1 Spatial Aggregation 

Spatial Aggregation is a recently developed computational approach to hierarchi- 
cal data analysis. It has been successfully applied to several diflicult data analy- 
sis and control problems such as interpretation of numerical experiments 1 1 bl I V) j . 
kinematics analysis of mechanisms 0 , design of controllers m reasoning about 
fluid motion HZ], and distributed control optimization p. 

SA features a multi-layer representation and a set of generic operators to 
transform spatial objects at a finer scale into ones at more abstract levels. The 
lowest level usually consists of the simplest, point-like spatial objects such as the 
image pixels in the raw data. From there on, domain knowledge is integrated into 
the generic operators to select salient objects, build appropriate neighborhood 
relations upon the objects, and aggregate them into more complex objects at 
the next higher level. After the aggregation, the newly formed higher level spa- 
tial objects generally contain richer domain-specific descriptions and aggregate 
spatial properties than their constituent spatial objects. These properties make 
the qualitative patterns more explicit and support the extraction of macroscopic 
behaviors. Details of the SA algorithm is described in the reference m- 



2.2 Strong Adjacencies and Weak Adjacencies 

Neighborhood relation is essential in spatial object aggregation because, intu- 
itively, a coherent spatial object has to be internally connected. We observe 
that neighborhood relations play two distinct roles in the aggregation process. 
Some bind a set of spatial objects into a single aggregate object and become 
“intra-relations” within the aggregate object after the aggregation. We call these 
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neighborhood relations strong adjacencies. The others do not bind objects at the 
current level but stand out as “inter-relations” to reveal the spatial relations be- 
tween the aggregate spatial objects after the aggregation. We call those weak 
adjacencies. 

With the finer classification of neighborhood relations into strong and weak 
adjacencies, we then introduce a two-step classification for building increasingly 
more abstract objects: each connected component in the graph defined by the 
strong adjacencies is re-described into a higher level spatial object; the weak ad- 
jacencies are aggregated and summarized to build neighborhood relations among 
aggregate spatial objects. This classification is especially useful, compared to the 
original SA algorithm, when it is inappropriate to simplify and cluster aggregate 
spatial objects as points. 

2.3 A Structure Finding Algorithm 

Based on the adjacency classification, we present a structure finding algorithm 
for spatial objects, as illustrated in Fig. 0 At each level, the neighborhood re- 
lations are classified into strong adjacencies and weak adjacencies. Then the 
strong adjacencies and the spatial objects they bind together are re-described 




Fig. 2. The structure finding algorithm for one layer of aggregation. The output 
is fed into the next layer of aggregation that has the identical computational 
structure. Rectangles denote operators and ovals data. 



into higher level spatial objects with the strong adjacencies serving as the inter- 
nal structure. The weak adjacencies are then aggregated to build neighborhood 
relations between higher level spatial objects. Therefore, at the next level, with 
the internal structure of objects abstracted away and the relations among ob- 
jects simplified by aggregation, the aggregate properties of the objects become 
more prominent. When object details are requested, the internal structure of the 



“Seeing” Objects in Spatial Datasets 115 

object is available at the lower level so that relevant information can be quickly 
located. In summary, the algorithm is capable of not only explicating structures 
from data, but also organizing information in a structured, hierarchical way. 

A new SA operator, adjacency-aggregate^ is introduced to support the aggre- 
gation of adjacencies with the following syntax: 

— adjacency-aggregate: agg-objs * weak-adjacencies * constr-op ^ N-relations 

It takes a collection of aggregate objects, the weak adjacencies among the con- 
stituents, and a constructor operator as inputs, and produces a set of neigh- 
borhood relations for the aggregate objects. The constructor operator constr-op 
constructs a neighborhood relation of two aggregate objects using all the pairwise 
weak adjacencies between their respective constituents. Different constructor op- 
erators may be employed according to the task requirements. The constructor 
operators currently supported by SAL are: 

— count: return the number of weak adjacencies as the data value for the new 
N-relation. 

— minimal: return the minimal weak adjacency as the data value for the new 
N-relation. 

— maximum: return the maximum weak adjacency as the data value of the 
new N-relation. 

— pack: return the set of weak adjacencies as the data value for the new N- 
relation. 

Users can also supply other constructor operators. Fig. El demonstrates the 
adjacency-aggregate operation using the count constructor operator. 




(a) Before adjacency-aggregation 



(b) After adjacency-aggregation 



Fig. 3. Adjacency-aggregate using the constructor count. A,B,C,D — aggregate 
objects; Dashed lines in (a) — weak adjacencies. Bold lines in (b) — computed 
N-relations. 



We have introduced the concept of structured spatial objects, strong and 
weak adjacencies, and the structure finding algorithm. This approach supports 
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the extraction of a hierarchy of structured spatial objects by iteratively aggre- 
gating the lowest level spatial objects and neighborhood relations. The original 
SAL library is extended to support the structure aggregation approach. 

3 Application: Extracting Trough Structures from 
Weather Data 

In this section, we use the weather application to demonstrate how the proposed 
approach can be used to extract abstract spatial objects from spatial datasets. 

Troughs and ridges are important features in weather analysis. High-altitude 
troughs show the bending of the jet streams in the high-altitude air circulation 
and are important for extended weather forecast. Surface troughs are usually 
closely related to fronts and hence are useful for locating the fronts. 

What are troughs and ridges? Visually, troughs and ridges are stacks of iso- 
bar segments bending consistently to one direction, with troughs corresponding 
to bendings pointing away from lower iso-bars and ridges away from higher iso- 
bars. Fig. [0(b) shows a trough. Due to the Coriolis force, winds tend to follow 
iso-bars. So the bending of iso-bars is an indication of sharp direction change of 
wind, which usually brings more advection and causes more mixing of warm air 
with cold air, and therefore, the deteriorating weather. 

Though the extraction of troughs seem effortless and immediate to human 
eyes, it is only qualitatively understood. Sometimes even experts may give dif- 
ferent answers about the existence of a trough in a weather map because of their 
different “mind-judgment” criteria. 

Intuitively, experts first observe the high bending segments of iso-curves in 
the chart and then extract the linear structures from these bending segments. 
Our trough finding algorithm emulates this process by first extracting the high 
bending segments of iso-bars, establishing neighborhood relations between these 
segments, and finally using these neighborhood relations to extract the linear 
structures among the segments to obtain troughs. This algorithm is a special 
instance of the earlier structure aggregation approach where aggregation param- 
eters are chosen using domain knowledge in weather analysis. 

An alternative approach to extracting troughs would be to find the high 
curvature points on the iso-bar curves and then extract the trough structures 
from these high curvature points. Since the high bending curve segments are more 
robust features than high curvature points, the trough structure built from the 
neighborhood relations of high bending curve segments is also more robust. 

The complete algorithm has a pre-processing step, two levels of aggregation 
and a post-processing step: 

— Pre-processing: Extract all the iso-points at the required contour levels. 

— Level I aggregation: 

• Aggregate: Build neighborhood relations upon the iso-points using De- 
launay triangulation. 

• Classify: Classify adjacency relations into strong adjacencies and weak 
adjacencies. 
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• Re-describe: Use strong adjacencies to aggregate iso-points into iso-curves. 

• Filter: Extract salient objects, i.e., segment iso-curves and extract high 
bending curve segments. 

— Level II aggregation: 

• Adjacency- aggregate: Build neighborhood relations upon high bending 
curve segments by aggregating weak adjacencies. 

• Classify: Classify the neighborhood relations of curve segments into 
strong adjacencies and weak adjacencies. 

• Re-descrihe: Use strong adjacencies to aggregate curve segments into 
trough structures. Trim the trough structures into linear structures. 

— Post-processing: Locate and draw the trough line position as in the standard 
weather analysis chart. 

The SA operators aggregate, classify, re-describe, filter and adjacency-aggre- 
gate are used in the above two- level aggregations. The filtering of curve segments 
at level I involves choosing an appropriate threshold for segmentation. We will 
discuss next the technical issues in computing a stable segmentation. 

3.1 Curve Segmentation 

We use curve fitting technique and split and merge algorithm to segment 

curves. Because iso-bar curves are generally smooth, we use constant curvature 
curves (straight lines and circular arcs) to fit iso-bar curves, i.e., we first trans- 
form an iso-bar curve into if — S space {if: the angle made between the tangent 
to the curve and a fixed line; S: the length of the curve from beginning), where 
constant curvature curves become line segments, and then use piecewise linear 
approximation to fit the transformed curve. 

Split and merge algorithm requires an error threshold for segmentation if the 
desired number of segments of a curve is unknown. The algorithm generates a 
segmentation satisfying the error threshold constraint and seeking to minimize 
the number of segments, and minimize the approximation error if the number of 
segments can not be minimized any more. Because the shape of iso-bar curves 
varies in a large range, appropriate thresholds need to be selected according to 
the inherent properties of the underlying curves. We have developed the iterative 
thresholding technique to find appropriate thresholds for satisfactory segmenta- 
tions of iso-bar curves. This technique generates a sequence of thresholds and 
use one of the two heuristics: ATS (Absolute Threshold Stability) or RTS (Rel- 
ative Threshold Stability) to choose the most stable one from them. Because of 
the space limit, we are unable to go into the details of the iterative thresholding 
technique here. Interested readers are referred to our technical report a.Fig.i 
shows two sample outputs of our segmentation algorithm. 

3.2 Extracting Bending Segments 

After iso-bar curves are segmented, high-bending segments can be extracted by 
thresholding. One problem arises when the bending is sharp: the segmentation 
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Fig. 4. Segmentation of two sample iso-bar curves. 



algorithm extracts a short curve segment that has a large bending since the bend- 
ing takes place in a very small interval; it is difficult for a short curve segment to 
establish spatial relations with other curve segments. However, the sharp bend- 
ing is the main feature we want to detect. To facilitate the detection, we perform 
a branch extension operation on the extracted segments. The branch extension 
operation extends a short curve segment of large bending to its neighbors if its 
neighbor segments are flat. Fig. E3gives an illustration of this operation. 




Fig. 5. The Branch Extension operation. Left: a curve and its segmentation; 
Center: segment extraction without branch extension; Right: segment extraction 
with branch extension. 



4 Experimental Results 

Fig. 0 shows a complete run of the trough extraction algorithm on a 500mb 
pressure data set. Each subgraph is labeled with the corresponding operation 
and the results obtained. Several things worth noting are: 

1. (b) and (d) may look similar, but (d) does not have the short strong adja- 
cency edges. 

2. Because of the branch extension operation, the segments can overlap with 
each other. So in the (f), a square is used to denote the beginning of a 
segment and a circle the end of a segment. 

Our algorithm detects one ridge (the longest one) and two troughs as expected. 

Fig. 0 compares high altitude troughs detected by the algorithm and a trough 
drawn by meterologists from real datasets. Fig. 0 (b) is the national weather 
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(a) Pre-processing-, iso-points 




(c) Classify (I): strong adjacencies 





(g) Adjacency-aggregate (II): N-relations 




(f) Filter : high-bending segments 




(h) Classify (II): strong adjacencies 




(i) Re-describe (II) and post-processing-, final results 
Fig. 6. A complete run of the trough extraction algorithm 
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Fig. 7. Labeling weather chart: (a) The high-altitude trough (dashed line) de- 
tected by our algorithm, (b) The corresponding trough (dashed line) drawn by 
meterologists for the national weather forecast map for roughly the same area 
as the box in (a). 



forecast mapQfor Friday, Jan. 15, 1999 (exact time unknown). The high altitude 
trough is shown as the dashed line. Our data is the 250mb pressure forecast data 
computed by ETA model for GMT 6am on Jan. 15, 1999 0 . The plot region in 
Fig. 0 (a) is from Latitude 20°N to 59°N and Longitude 145°W to 65°W. Our 
algorithm detects a trough and a ridge. The troughs in the Fig. 0 (a) and (b) 
are roughly at the same position. The trough by the meterologists seems to be 
more pleasant visually because it is manually trimmed. In fact, the exact shape 
and position are not very important for a synoptic map at these scale. 

Currently, the algorithm works well for high altitude pressure datasets but 
may miss some surface troughs. The reason is that we use predetermined thresh- 
olds, including the thresholds to determine when a curve segment is a high- 
bending segment, how far to extend a curve segment in the branch extension 
operation and when the bending directions of two curve segments are considered 
“similar” in our algorithm. Unlike the threshold in curve segmentation, these 
thresholds currently are not adaptively tuned by the underlying data. Finding 
appropriate thresholds according to the inherent properties of the underlying 
data is a very interesting and hard problem. Our iterative thresholding tech- 
nique is an effort toward solving it. We will have to explore this problem more 
in our future work. 



5 Discussion and Related Work 

Spatial data mining is an active research field that leverages recent advances in 
information processing and storage technology. The goal of spatial data mining is 
to extract implicit knowledge, patterns, and relations from spatial databases. Our 
approach emulates the way human experts perceive structures in large datasets 

^ URL: http://www.weathersite.com/enlargetomorrow.html 
^ URL: ftp://nic.fb4.noaa.gOv/pub/erl/eta.00z 
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in order for machines to “see” the same structures. It uses existing domain knowl- 
edge, expressed as parameters at various levels of aggregation, to aid the percep- 
tion and understanding of spatial datasets. The two approaches are nonetheless 
complementary and can benefit from each other. Recent spatial data mining 
research also studies the aggregation techniques and spatial relations. For in- 
stance, Ng and Han described a very efficient algorithm, CLARANS, to find 
spatial clusters of points in large databases. Knorr and Ng presented the CRH 
algorithm for building proximity relations among aggregated spatial structures 
represented as point clusters and features represented as polygons. 

This work is largely inspired by the pioneering work of Ken Yip in interpreting 
large fluid datasets [ni. It also shares a similar objective with the visualization 
work at Rutgers Universitv jl 2f I lij . The Rutgers group studies the visiometrics 
process that identifies, classifies, and tracks observable features (i.e., spatial ob- 
jects in the terminology of this paper) in fluid datasets. The spatial objects 
studied by the Rutgers group are mostly iso-clusters, and classified by shape 
parameters such as area, curvature, torsion and moments. SA examines multiple 
levels of structural aggregation, whereas the visiometrics process corresponds to 
one layer of spatial aggregation. The Rutgers work introduces a rich vocabulary 
for spatial properties of fluid objects and a temporal object tracking mechanism 
that SA can build upon. 

The importance of structural information in object recognition has long been 
recognized by computer vision researchers. In IS), Ullman described the well- 
known scrambled face example to show that individual features represent noth- 
ing meaningful unless they are properly arranged in space. Structural description 
m is one of the main approaches in object recognition in computer vision but 
is nonetheless under-researched because of the difficulties in computing struc- 
tural information and representing the information suitably to facilitate robust 
and efficient matching. Our approach makes significant contribution to the com- 
putation of structural information by providing a systematic way of building 
neighborhood relations upon complex aggregated spatial objects. 

6 Summary 

We have presented a structure aggregation approach to extracting abstract spa- 
tial objects from spatial datasets. This work builds on SA to hierarchically derive 
neighborhood relations at different abstraction levels. Its application to trough 
extraction shows that it is capable of extracting visually salient and mathemat- 
ically ill-defined spatial objects with relatively simple structures. 

The work described here focuses on the extraction of spatial objects. Future 
work will study the identification of spatial objects that requires a suitable rep- 
resentation of structural information and an efficient and robust mechanism to 
match extracted structures against standard templates. We will continue to use 
the weather data analysis as the domain to explore these research issues. 
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Abstract. In this paper, we deal with the degradation detection prob- 
lem for telecommunication network gateways. The time series to be mon- 
itored is non-stationary but almost periodic (pseudo-periodic) . The au- 
thors propose a technique called “optimization of partition model” which 
generates local stationary models for a pseudo-periodic time series. The 
optimization is based on the minimal AIC principle. The technique called 
SPRT is also applied to make efficient decisions. Experiments to evaluate 
methods for optimization, incremental model update, and the compar- 
ison with the conventional method are conducted with real data. The 
result shows the proposed method is effective and makes more precises 
decision than the conventional one. 



1 Introduction 

In a telecommunication network, multiple gateways work simultaneously to es- 
tablish communication paths to a designated foreign network. As there exist 
many constraints in obtaining the detailed status of foreign networks, allocation 
of communication calls to gateways is decided by observing the rate of established 
calls on the gateways. When the rate of established calls significantly goes down 
on a certain gateway, the number of calls to be allocated to the gateway should 
be reduced. A major task of gateway monitoring is to find the degradation of 
the rate of established calls for each foreign network^- 

In modern networks, there are many kinds of communication traffic at one 
time, and the behavior of the traffic is not a simple stochastic process. Therefore 
many heuristic approaches are applied. Monitoring methods using expertise work 
well for some specific gateways. However, the rapid change of traffic behavior 
caused by new services spuars continual efforts to update heuristics in monitoring 
systems. 

As a result there is a strong emphasis on intelligent data analysis techniques 
that can deal with real-time data streams from on-line monitor of gateways. 

The method for dealing with this degradation detection problem requires 
the handling of real-time data streams, which are a non-stationary time series, 
and also the continuous updating of model parameters. The basic concept for 
solution of the above requirements is as follows: 



D.J. Hand, J.N. Kok, M.R. Berthold (Eds.): IDA’99, LNCS 1642, pp. 123-^2H 1999. 
(c) Springer- Verlag Berlin Heidelberg 1999 
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— Conventional methods deal with the rate of established calls directly. The 
definition of this rate is the ratio between the number of calls connected 
successfully and the number of attempted calls. We may regard these two 
values as the number of success occurrences and that of trials in Bernoulli 
trials respectively. The authors propose use of a binomial distribution model 
with these two values. 

— The entire time series is non-stationary, but when we divide a time series into 
proper size segments, each segment can be regarded as stationary. Akaike’s 
Information Criterion ( AIC) P] is useful for making partitions at proper po- 
sitions in the time series. The authors propose to apply this technique to the 
degradation detection problem. 

— In a short term range such as several weeks, the rate of established calls 
approximately depends on the day of the week, the rate is almost periodic 
and the interval of a cycle is a week long. The authors call this phenomenon 
pseudo-periodic, and call the interval of a cycle a pseudo-period. We can 
regard the time series of the segments, which belong to the same position 
in cycles within such a short range, as the same stochastic process. Authors 
propose to deal with such segments in the same model. By means of this 
approximation we can use more samples in each segment. 

— The Sequential Probability Ratio Test (SPRT) developed by Wald0 is suited 
for real-time decisions. It is popular in aerospace applications 0, but have 
not been used in this domain. The authors propose to apply this efficient 
stochastic test to our degradation detection problem. 

2 Optimization of Partition Model for Pseudo-Periodic 
Time Series 

2.1 Definition of Partition Model 

Observations are repeated with an equal time interval. Let 
S = (ni, ri), (ri 2 , f 2 ), • • • , (riM-i-i, (riM+ 2 , rM+ 2 ), • • • , 

— — — 7 ' ' ' 7 (j^raM 7 '^mM ) 

be a discrete time series, where {nt,rt) denotes the pair that consists of the 
number of trials and that of success occurrences at time t, M is the pseudo 
period, and m is the number of the pseudo-periodic cycles. 

Let P® = ■ ■ ■ , be a set of partitions (called here partition model), 

where i is a unique identifier of the model and s(i) is the number of partitions 
in r\ 

Let Sl,S 2 ,---, Sl^-^ be the set of observed data divided by P® and F® (0 < 
Fj < M) be the position of a partition tt®, where S'® is given as follows. 

J j j 

• • • , {nM+F^-i 7 rM-i-F^-i )7 
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' i (^(m-l)M+F’ - 1 j ?'(m- 1)M+Fj -l) } 

Sj denotes the set of observed data which exists between and f* . Note 
that S is non-stationary but the distribution of 5® is binomial under our as- 
sumption. 



2.2 Partition Model Selection Based on Minimal AIC Principle 

Let P{r\n,p) be the probability that r successes occurs in n trials, each of which 
has probability p of success. 

Log-likelihood of S'* is 

ll{S}\p)= logP{r\n,p) 

(n,r)eS'^ 

= Y ^OgnCrP^a-pT-^ 

(n,r)GS* 

= ^ log nCr + i?;- logp + (iv; - R]) log(l - p) , 



where i?* = Y^ ^ j ^ ■ 

(n,r)(^S'. {n,r)^S'. 

When pj is the maximum likelihood estimation of p for Sj, ^ ^Z(S*|p*) = 0. 
Thus, Pj = j^- The maximum log-likelihood of P* is 



s(i) + l 

MLL(P*) = ^ ll{S}\p^) 
i=i 

s(i) + l 



. i?* m-m 

= { E log log^ + (iv;-i?*) log^^ 



i-1 

mM 



_ (n,r)eSj 



s(») + l 



Rl 



= Y log n,Cr, + E 1 log + log 



t=l 






NY 3 . 
Y 



Log-likelihood is inappropriate for comparing models when the number of 
parameters of each model is different. Akaike information Criterion (AIC) 0| 
in the next equation is known to be a correct measure for a general model 
comparison. Let 6, LL{6), |0| be the model to be compared, log-likelihood of the 
model 9, and the number of free parameters in the model. 



AIC{0) = -2 X LL{9) -k 2 X |6i| 
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Based on the minimum AIC principle, the model 9 which has the minimum 
AIC{9) is the best model. 

Note that ntC'rt in MLL{V^) appears in all partition models, and we 

can ignore it. The most probable partition model V in • • • , } is obtained 

by the following equations. 



MLL'{V^) = 




i?}log^ + (iV]-i?p log^^ 



AIC\V^) = -2 X MLL'iV^) + 2 x {s(z) + 1} 

V = argminA/C'(PO 



3 Degradation Detection by SPRT 

The conventional scheme of SPRT (Sequential Probability Ratio Test) is ex- 
plained first and application of SPRT to our degradation detection problem is 
described. 



3.1 Scheme of SPRT 

SPRT is a well-known method for testing a hypothesis against an alternate 
one. Let vi,V 2 , ■ ■ ■ denotes the recent successive samples in a given time series 
vector. The basis for the SPRT is the recursive calculation of the logarithm of 
the likelihood ratio (LLR) function of the normal model and an alternate model 
with recent q samples 



LLR{q) = log 



Pq{vi,V2,---,Vq\Hl) 

Pq{vi,V2,...,Vq\Ho) 



where Pq{vi, . . . ,Vq\Hi) is the probability density function when the process is 
degraded (hypothesis Hi is true), and Pq{vi, . . . ,Vq\Ho) is also the probability 
density function when the process is normal {Hq is true). After assuming that 
Vi and Vj are independent, the above formula becomes additive and the LLR 
function is computed recursively 

LLR{q) = LLR{q - 1) + log 

P(Vq\Hoj 



where P{vq\Hi) and PivqlHo) are probability density function yielding Vq when 
the process is degraded or normal, respectively. LLR{q) is compared to two limits 
(degraded/normal), with a gray range between them. When it lies in the range, 
there is no decision. These two limits are derived from the allowable false alarm 
rate and the allowable missed alarm rate chosen by users. 
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3.2 SPRT for Degradation Detection in Binomial Models 



SPRT requires two functions. One function evaluates the probability that the 
process is degraded and the other evaluates the probability that the process 
is normal. In the case of our degradation problem, LLR function is defined as 
below: 

Let p be the maximum likelihood estimation of the segment in the most 
probable partition model to which the current observation belongs, let e (0 < 
e < 1) be a weighting parameter which is introduced to control the sensitivity of 
degradation, and let Prob{n,r\Hi) and Prob{n,r\Ho) be the probability when 
the process is degraded and normal respectively. 



LLR{n, r) = log 



= log 



Prob(n, r\Pli) 
Prob{n, r\Ho) 
P{r\n,py,) 
P{r\n,pg) 



= log 



nCrPl (I -Par 



= rlog h (n - r) log , 

Pg l-Pg 



where Pw is an expected success probability under the degraded mode, and pg is 
an expected success probability under the normal mode. Pw and pg are defined 
as follows. 



Pw = 
Pa = 



r/n 


if ep > r/n 


(1) 


ep 


otherwise 


r/n 


if p < r/n 


(2) 


P 


otherwise 



The rough meaning of Pro6(n, r|iJi) is the probability that the process is 
working under the condition that the success probability is lower than ep. The 
smaller e is, the smaller the number of detections becomes. Fig. Q] shows an 
example of Prob(n,r\P[i) and Pro6(n, r|iLo). 

By the above definitions, we can apply SPRT to our problem and make an 
efficient decision when new observed data is obtained. 



4 Empirical Results with Real Traffic Data 

In this section, experiments were conducted with the traffic data of international 
telephone and ISDN calls handled by KDD company. The data acquisition of 
real traffic was obtained by on-line monitors equipped with gateways, and carried 
out in two periods in 1994. The length of each period was 5 weeks long. The 
first period was from Jan. 23 to Feb. 26, and the second one was from Mar. 27 
to Apr. 30. One national holiday is included in each period. In the experiments 
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Fig. 1. The X-axis shows the number of success occurrences r, and the Y- 
axis shows the Prob{n,r\Hi) and Prob{n,r\Ho) when n = 50, p = 0.7, e = 0.6. 
Probabilities for Bernoulli trials with a success probability p = p, ep, 0.1, 0.2, • • • 
are also indicated 



in this paper, the first period is used for training data, and the second is used 
for test data. The time interval of data acquisition is 5 minutes. Thus the time 
series of each target foreign networks has 9,792 samples for the training and for 
the test, when the data of a national holiday is excluded. The number of target 
foreign networks was 1,516 in each periods. 

Partition models may be generated by an exhaustive enumeration. But such 
a blind procedure often causes a computational explosion and makes infeasible 
partition models. Fig. show the one of typical traffic behaviors in every day 
of the week. In this example, the behavior of the week-day is much different 
from that of week-end, and Monday is slightly different from other week-days. 
Domain experts knows such phenomena well, and prepared feasible partition 
models. Every segment in the prepared models starts at the beginning of a day 
and finishes at the end of a day. The number of prepared partition models (showed 
latter) becomes 11 when national holidays are not taken into account. 

Execution of the optimization of partition model, incremental model update, 
and the degradation detection by SPRT is sufficiently fast in the following ex- 
periments, and the proposed method is able to satisfy real-time requirements. 



4.1 Evaluation of Optimization of Partition Model 

At first, all partition models for all target networks are generated from the train- 
ing data. Then every model is applied to the test data, and log-likelihood of each 
model is measured for evaluating the fitness of the model. In this test, models 
are fixed and not updated. 
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Mon p= 0.6 9 0 Tue p-0.708 





0 30 60 90 120 



Fig. 2. These are the scatter graphs of observed time series in every day of the 
week. The X-axis shows the number of attempted calls, and Y-axis shows the 
number of calls connected sucessfully. Regression line and its gradient value are 
indicated. 
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Table 1. Order of row is an increasing order of AIC. Thus the best model is 
on the top row, and the worst is on the bottom. In the field of partition model, 
“M”, ”T”, • • • denote Monday, Tuesday, • • • respectively. Location of partitions 
are indicated by |. For example, the best model consists of four segments, such 
as Monday to Thursday, Friday, Saturday, and Sunday 
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Table 0is a complete example of eleven partition models for a specific foreign 
network. In the table, the log- likelihood of the model whose number of segments 
is seven, is maximum in the training data, but not maximum in the training 
data. On the other hand, the best model selected by the minimal AIC principle 
has the largest log-likelihood in the test data. The order of AIC in the training 
data is similar to the order of log-likelihood in the test data. And models having 
similar partition structures have similar AIC values. This shows that the minimal 
AIC principle works well in this example. The diffrence of log- likelihood values 
in the test phase may be small. The authors are sorry not to mention about the 
statistical significance of the difference in the theory. But 9792 observed samples 
are used to calculate each log-likelihood value. The reliability of the obtained 
log-likelihood values is high. And more, other experiments which are conducted 
with reduced samples showed the same tendency. 

In general, the best model in the training data is not always the top in the 
rank of log-likelihood in the test data. Fig. 0 shows the ranks of all the best 
models in the test data. As the curve of accumulated probability is convex, 
we can conclude the optimization of the partition model based on the authors’ 
proposal is effective for a given pseudo-periodic time series. 

4.2 Evaluation of Incremental Model Update 

In this section, two methods are compared. The first method retains the same 
model selected in the training. The second one re-selects the best partition model 
and updates its parameters at the beginning of the day. Initial models generated 
from the training data are the same in both two methods. 
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Fig. 3. X-axis shows the rank of selected models in the log-likelihood of the test 
data. Y-axis shows accumulation of probability 

Table 2. Sum of log-likelihood for all foreign networks in the test data is 
described. 



Type of method 


Sum of log-likelihood 


Fixed model 

Model updated incrementally 


-7,857,152.0 

-7,852,232.9 



Table 0shows the result when two methods are applied to the test data. The 
log-likelihood of the incremental method is larger than that of the method using 
fixed models. As sufficient number of samples are used, the method proposed for 
the incremental model update is believed to be effective in practice. 

4.3 Utility Comparison with a Conventional Method 

In the conventional method used here, the rate of established calls is compared 
to the threshold at each observation. The threshold is basically decided by the 
distribution of the observed rate, which is assumed to be a Gaussian distribution, 
and fine tuning of the threshold is carried out by network operators continuously. 

The comparison between the conventional method and the proposed one is 
made in the following two measures: One measure is the number of detections. 
It can be controlled by a sensitivity parameter e in the proposed method, but 
fixed in the conventional one. The other measure is Total Loss of Chance. Let 
T be a set of time when degradation is detected, and pt,nt,rt be an expected 
success probability at time t, and the number of trials and success occurrences 
respectively. Total Loss of Chance is defined as below. 

Total Loss of Chance = E {ntpt -rt\ 

t&T 

Total Loss of Chance is also controllable by e in the proposed method, but fixed 
in the conventional one. 
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method method 




0.0 0.2 0.4 0.6 0.8 1.0 

epsilon 




0.0 0.2 0.4 0.6 0.8 1.0 

epsilon 

(| 3 ) 



Fig. 4. (a) Relation between a sensitivity control parameter e and the number 
of degradation detections, (b) Relation between a sensitivity control parameter 
e and Total Loss of Chance 



Fig. 01 shows the results of the two measures in the test data. In Fig. 0] (a), e 
is 0.72 when the number of detections of each method is the same. In Fig. 0](b), 
when e is 0.72, Total Loss of Chance of the proposed method is about 4.6 x 10®, 
which is 1.6 X 10® larger than that of the conventional one. These results show the 
proposed method can make more precise decision than the conventional method. 

5 Related Works 

Takanami and Kitagawa j0| shows an application which generates proper models 
called locally stationary AR models based on the minimal AIC principle from a 
non-stationary time series. The proposed method also uses the same principle, 
but is extended to a pseudo-periodic time-series. In our method, not only adja- 
cent samples but also samples whose interval is pseudo-periodic are in the same 
stochastic model. 

A long sequences of successive normal states are often observed in high qual- 
ity networks. Such sequences are harmful for SPRT in the sense that the ac- 
cumulation of LLR becomes too large. In order to solve this problem, Chien 
and Adams 0 propose that LLR be reset to zero when the accumulation reaches 
either of the limits. Uosaki|3 proposes another solution, a backward evaluation 
of LLR continuing backward until a decision is obtained. We adopted Chien’s 
method because Uosaki’s method requires much memory to keep the most recent 
log of LLR. 
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6 Conclusion 

In this paper, the degradation detection problem for the rate of established calls 
on the gateway is dealt with. The time series to be monitored is non-stationary 
but almost periodic {pseudo-periodic) . Methods for resolving this problem reqi- 
ure the continuous handling of pseudo-periodic time series and updating model 
parameters. To solve these requirements, the authors propose the following: 

1. Binomial models using the number of sucessfully connected calls and at- 
tempted calls instead of the rate of established calls. 

2. Applying the technique generating local stationary models to this problem. 
The technique generates candidates for division by time series, and selects 
the best division among candidates for division based on the minimal AIC 
principle. 

3. Extending the above technique for a pseudo-periodic time series, in which 
non-adjacent local stationary segments, whose distance is a pseudo-period, 
are regarded as the same stochastic process. 

4. Applying a technique called Sequential Probability Ratio Test (SPRT) to 
this problem in order to perform efficient degradation detection. 

In section 2 , we showed the optimization of a partition model for this problem 
based on the above proposals 1,2,3. The optimization and updating of model 
parameters can be executed efficiently from the equations derived finally in the 
section. 

In section 3 , we showed how to apply the technique called SPRT. This tech- 
nique requires the two probabilities that the process is degraded or normal. 
Equations to calculate these probabilities are described. 

Evaluation of the proposed method for real data is described in section 4 . 
A complete example of partition models is shown. In this example, the selected 
model based on the minimal AIC principle is also the best model in the test 
data. And the ranks of selected models in the test data are investigated. The 
result shows the proposed method for optimization is effective. The incremental 
model update is also tested. As the log-likelihood of incremental models is better 
than that of fixed models, the effectiveness is proven. Finally we compare the 
proposed method with a conventional method by two measures, the number of 
detections and Total Loss of Chance. The result shows the proposed method 
makes more precise decisions than the conventional one. 

As the proposed method is proven to be more useful than the conventional 
one, the proposed method is scheduled to be used in systems at KDD this au- 
tumn, which monitor telephone and ISDN services. 
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Abstract. Human information processing can be monitored by analy- 
sing cognitive evoked potentials (EP) measurable in the electro encephalo- 
gram (EEG) during cognitive activities. In technical terms, both visual- 
ization of high dimensional sequential data and unsupervised discovery of 
patterns within this multivariate set of real valued time series is needed. 
Our approach towards visualization is to discretize the sequences via vec- 
tor quantization and to perform a Sammon mapping of the codebook. 
Instead of having to conduct a time-consuming search for common sub- 
sequences in the set of multivariate sequential data, a multiple sequence 
alignment procedure can be applied to the set of one-dimensional discrete 
time series. The methods are described in detail and results obtained for 
spatial and verbal information processing are shown to be statistically 
valid, to yield an improvement in terms of noise attenuation and to be 
well in line with psychophysiological literature. 



1 Introduction 

Psychophysiological studies use the method of cognitive evoked potentials mea- 
surable in the encephalogram (EEG) during cognitive activities to monitor phys- 
iological correlates of human information processing. EEG is a non-invasive 
method to record electric brain potentials from the human scalp via a set of 
electrodes. We speak of cognitive evoked potentials (EPs) when the EEG is 
recorded from a test subject who is solving a cognitive task during recording. 
An EP is defined as the combination of the brain electric activity that occurs 
in association with the eliciting event and ‘noise’, which is brain activity not 
related to the event together with inference from non-neural sources. Since the 
noise contained in EPs is significantly stronger than the signal, the common 
approach is to compute an average across several EPs recorded under equal con- 
ditions to improve the signal-to-noise ratio. The average s(t) over the sample of 
N EPs is used to estimate the underlying signal s{t): 

1 iV ^ N 

i = l,2, ...A^; 0<t<T (1) 

D.J. Hand, J.N. Kok, M.R. Berthold (Eds.): IDA’99, LNCS 1642, pp. 137-|i4^ 1999. 

(c) Springer- Verlag Berlin Heidelberg 1999 
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where Xi{t) is the ith. recorded EP, s{t) the underlying signal, rii(t) the noise 
associated with the ith. EP, and T the duration over which each EP is recorded. 
The crucial assumption behind averaging is that the evoked signal s{t) is the 
same for each recorded EP Xi{t). Whereas this is true for simpler sensoric or 
motoric events, cognitive activities do not elicit one specific EP waveform time 
locked to the onset of the recording. Only subsequences of the whole EPs that 
do not occur at fixed time after the onset of the recording can be expected to 
be due to the cognitive task. 

Our approach towards the analysis of cognitive evoked potentials (EP) com- 
bines several intelligent data analysis methods (see Fig. [IJ to tackle these prob- 
lems. Since each EP is measured via a number of electrodes it is a multidimen- 
sional time series. After appropriate filtering, we visualize this high dimensional 
sequential set of data by replacing the sequence of the original vectors by a se- 
quence of prototypical codebook vectors obtained from a clustering procedure. 
Additionally, a dimensionality reduction technique is applied to obtain an or- 
dered one-dimensional representation of the high dimensional codebook vectors 
that allows for the depiction of the original sequence as a one-dimensional time 
series. Searching for common subsequences in the vast set of real valued multi- 
variate sequential data is computationally prohibitive. Instead we can use the set 
of univariate discrete time series, the trajectories across codebook vectors, and 
apply a multiple sequence alignment procedure for comparison of sequences. Fi- 
nally, we are able to compute an alternative selective average across the obtained 
subsequences. 

Especially the analysis of the temporal structure of cognitive EPs is a largely 
unsolved problem in psychophysiology. Classical methods like mm and PI 
are designed for univariate time series of simpler motoric or sensoric EPs only. 
They usually assume that the recorded univariate signal is the same for all EPs 
during the whole duration of the recording but allow variable latencies of the 
common waveform. Therefore they cannot really cope with the harder problem 
of analysing cognitive EPs. Existing data mining approaches to processing of 
sequential patterns are not applicable to our problem for the following reasons: 
Template based approaches require a query pattern or frequent episode 0 to be 
defined before the search is started which is not possible for cognitive EPs since 
only very vague knowledge about the subsequences to be discovered exists. Tem- 
plate based approaches are also designed for univariate or symbolic sequences 
only. The same holds for specialized approaches given e.g. in 0 and 0 which 
have the additional problem of being hard to link to a model of cognitive EPs. 

Our work is structured in the following way: First we describe the EP data 
sets and then all applied methods (clustering, visualization, sequence compar- 
ison) are presented in detail. Statistical significance is ensured via comparison 
with results obtained for artificial data sets, the gain in noise attenuation rel- 
ative to common averaging is quantified and the results for spatial and verbal 
information processing are shown to be well in line with literature. All com- 
puter experiments have been done within a rigorous statistical framework using 
appropriate statistical tests. 
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Fig. 1. Flow diagram of the spatio-temporal clustering approach. 



2 The Data 

The data stems from 10 good and 8 poor female spatializers who were subjected 
to both a spatial imagination and a verbal task. The complete data base of EP 
recordings is therefore divided into four groups: 319 EPs spatial/good, 167 EPs 
spatial/poor, 399 EPs verbal/good, 270 verbal/poor. After appropriate prepro- 
cessing (essentially limiting the signals to frequencies below 8Hz and eliminating 
the DC-like trend by subtracting a linear fit), each EP trial consists of 2125 sam- 
ples, each being a 22 dimensional real valued vector. One complete 22-channel 
EP trial (duration is 8.5 seconds) is depicted in Fig. 0a). The discretization step 
described in Sec. 0 will make the analysis of this vast data set computationally 
tractable. 

3 Clustering and Visualization 

The EP time series are vector quantized together by using all the EP vectors 
at all the sample points as input vectors to a clustering algorithm disregarding 
their ordering in time. Then the sequence of the original vectors x is replaced by 
the sequence of the prototypical codebook vectors x. There is a double benefit 
of this step: it is part of the visualization scheme and the sequences of x serve 
as input to the sequence comparison procedure. 
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Fig. 2. (a) Example of a complete 8.5 second 22-channel EP recording, (b) The 
corresponding trajectory across codebook vectors depicted as ordered codebook 
numbers (y-axis) as a function of time (x-axis). 



AT-means clustering (see e.g. p.201]) is used for vector quantization using 
the sum of squared differences d{x, x) = J2i=o \ Xi — Xi 1“^ as measure of distance, 
where both x and x are of dimension k. Since observation of the sum of distances 
d{x, x) with growing size of codebooks did not indicate an optimal codebook 
size, we pragmatically decided to use 64 codebook vectors which is sufficient to 
preserve all important features in the sequence of codebook vectors. “Important” 
features are positive and negative topographical peaks and their development 
in time. The high number of different discrete symbols (64 codebook vectors) 
did not allow for a more principled information theoretic approach to obtain an 
optimal codebook size. Instead of a set of 22-dimensional time series, we now have 
sequences of discrete symbols, where each symbol is drawn from the alphabet of 
the 64 codebook vectors x. For the 64 codebook vectors, we calculated a 64 x 64 
distance matrix Dq- 

The sequences of codebook vectors can be visualized in a graph where the 
x-axis stands for time and the y-axis for the number of the codebook vector. 
Since in the course of time, the trajectory moves only between codebook vectors 
that are close to each other in the 22 dimensional vector space, this neighbour- 
hood should also be reffected by an appropriate ordering of the numbers of the 
codebook vectors. Such an ordered numbering results in smooth curves of the 
time vs. codebook number graphs and enables visual inspection of similarities 
between trajectories. We obtain such an ordered numbering by first performing 
a Sammon mapping H2| of the 22-dimensional codebook vectors to one output 
dimension and by then renumbering the codebook vectors according to a triv- 
ially achieved ordering of their one-dimensional representation. This combined 
technique of AT-means clustering plus Sammon mapping of the codebook is de- 
scribed in ^ and an application to the analysis of EP data in |5|. An example 
for a trajectory across an ordered set of codebook vectors is given in Fig. 0b). 
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Note that the ordering of the numbers of the codebook vectors is needed only 
for visualization and is not necessary for the subsequent sequence alignment. 

4 Sequence Comparison 

We chose a so-called fixed length subsequence approach for comparison of the 
sequences made of 64 discrete symbols (corresponding to 64 codebook vectors x). 
Given two sequences E and F of length m, all possible overlapping subsequences 
having a particular window length W from E are compared to all subsequences 
from F. For each pair of elements the score taken from the distance matrix Dq 
is recorded and summed up for the comparison of subsequences. The distance 
between two subsequences of length W from two sequences E and F is therefore: 

w-i 

DaUgn{be,bf,W) = d{Eb^+i, Fb^+i) (2) 

i=0 

The indices bg and bf are the beginning points of the subsequences in the 
sequences E and F and Eb^+i and are the corresponding codebook vectors. 

Successive application of this pairwise method allows for the alignment of more 
than two sequences. Such a, fixed subsequence approach that is explicitly designed 
for multiple sequence alignment is given by [P. It computes a multiple alignment 
by iteratively comparing sequences to the multiple alignment obtained so far, 
keeping always just the L best subsequences as an intermediate result. 

This approach to multiple sequence alignment is called progressive align- 
ment. It works by constructing a succession of pairwise alignments. Initially, two 
sequences are chosen at random and aligned via the fixed length subsequence 
approach described above. The L best pairwise alignments (i.e. pairs of subse- 
quences) with minimum distances DaUgn are now fixed and stored in a heap. 
Then a third sequence is chosen at random and aligned to all the L pairwise 
alignments. The L best three-way alignments are now fixed and stored in the 
heap. This process is iterated until all sequences have been aligned. 

When a subsequence is compared to an intermediate “more” -way (let us say 
p-way) subsequence, the resulting score is computed as the sum of the p pair- 
wise comparisons of the subsequences in the intermediate solution with the new 
subsequence that is to be aligned. The number of all such crosswise comparisons 
within the final overall alignment is given by P = number of all 

element-wise comparisons within the final overall alignment is given by W P, and 
its average per element, the average element-wise within alignment distance, by: 

^ p-i p-i 

Dglign = p Dalign{bi, bj ,W) (3) 

Desired is a set of beginning points for which DaUgn is minimal. The 
j^min same for all d = 22 channels of the corresponding ith EP. To di- 

minish the variability of the results of this stochastic algorithm (random usage 
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of sequences during iteration), we compute five such alignments and obtain an 
overall alignment and overall beginning points 5^™® at the points in time where 
the sums of the five D align are minimal. The number of single element-wise com- 
parisons to obtain one alignment is LW {m+l — W)P. For a given L and m, this 
function is proportional to p^, in contrast with mP comparisons in “brute force” 
searching where not just the L best but all possible alignments are considered. 
As experiments with L equal 100, 1000 and 10000 showed, it is sufficient to keep 
100 intermediate results to avoid the omission of good alignments that are weak 
in the first few sequences but strong in the later ones. Experiments varying the 
window length W from 31 to 62, 125 and 187 showed that W = 125 (correspond- 
ing to 500ms of EP) is short enough to yield alignments of satisfactory quality 
which are still long enough to be significant in terms of their psychophysiological 
interpretation. For more detail on tuning of the parameters L and W see [S|. 

For each channel of EP we can compute an alternative selective average s'{t) 
where the duration T is equal to the length of the subsequences, W, and the 
beginning points of the averaging are the parameters 

1 ^ 

= + 0 <^<^ ( 4 ) 



5 Results 



In a related study 0 working on a subset of our EP data we have shown the 
statistical significance of our approach. We verified that our procedure yields 
better results for real human EPs than for unstructured random input in terms 
of average element-wise within alignment distance D align (see Equ.|3). We com- 
pared results obtained from 21 EPs of one test subject with time-shuffled EPs 
and artificial EPs. The latter consisted of random Gaussian sequences whose 
power spectrum was changed appropriately to resemble the characteristics of 
real EPs. A one-way analysis of variance plus additional Duncan t-Tests allowed 
us to rank the result for real EP as being significantly better than the result for 
time-shuffled EP, which is again significantly better than the result for random 
Gaussian EP. 

To compare the gain in noise attenuation of the common average and of our 
selective average, the respective estimated standard deviations of the background 
noise, a-(t) and a'it), are being compared. 



a{t) 



N- 1 



( 5 ) 






sE [xdbr^^+t)-s'{t)fy 



N -1 



( 6 ) 
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Since the a{t) and ^'{t) are given for each of the d = 22 channels and for the 
duration of t = m or t = W respectively, the following average estimates of the 
standard deviations of the background noise are being computed: 



d m — 1 






(7) 



j=l t=o 



S' 



1 

dW 



d W-1 



Y. Y m 



( 8 ) 



S is the estimate for common averaging and S' for selective averaging. An 
o'j{t) is the (j{t) for channel j given by Equ. 0 An d' (t) is the tt'(f) for channel j 
given by Equ.0 For all EPs of good and poor spatializers doing the spatial imag- 
ination task the common average s as well as five selective averages s' have been 
computed. Results for good spatializers were S = 7.68 vs. mean s' = 4.35 ± .068 
and for poor spatializers S = 7.84 vs. mean s' = 4.37±.048. Computing Z- values 
shows the differences in noise attenuation to be significant: 



Zgood = 1(4.35 - 7.68)/(.068/v^)| = | - 109.5| > Zgg = 2.58; 
Zpoor = 1(4.37 - 7.84)/(.048/v^)| = | - 161. 6| > Zgg = 2.58. 



The estimated expected magnitude of the noise residual is now only « 0.56 
times that of the respective common averages. This is a gain in noise attenuation 
of more than 40%. 

The results of computing selective averages via beginning points 5™™^ for 
both good and poor spatializers doing the spatial imagination task are given in 
Fig. 01 as sequences of topographical patterns. Each topography is a spherical 
spline interpolation of the 22 values at a single point in time of the selective aver- 
aging window. Given are topographies at 40, 80, • • • , 440, 480msec of the window 
for poor spatializers (top two rows) and good spatializers (lower two rows). We 
can see that for both groups there is one specific dominant topographical pattern 
visible, albeit at changing levels of amplitude. It is a pattern of more positive 
amplitudes at frontal to central regions relative to more negative amplitudes at 
occipital to parietal regions. This common topographical pattern is generally 
more negative for poor spatializers. 

Our results obtained via the method of selective averaging have also been 
analysed by a series of analyses of variance (ANOVA). In accordance with the 
procedure of analysis of classical averages, selective averages of EPs are computed 
separately for each test subject and serve as inputs to the ANOVAs. A selective 
average s'{t) is computed separately for a test subject by averaging across all 
corresponding EPs, where the starting points of the averaging are the parameters 
duration is equal to the length of the subsequences, W. 

Besides factors “Task” (spatial vs. verbal), “Performance” (good vs. poor) 
and “Location” (electrode position) we decided to include another factor “Time” 
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Fig. 3. Sequences of topographies for poor spatializers (top two rows) and good 
spatializers (lower two rows). Scale is from —4 to +4mV. 



into our analyses. This factor is needed to describe the variation of the amplitude 
level of the selective averages within the course of time. Six evenly spaced points 
in time within the selective averaging window suffice to allow for a proper analysis 
of this temporal variation. 

The first analysis of variance was computed to test for significance of the 
general differences between spatially and verbally evoked subsequences of to- 
pographies. The results of this Task (two repeated measures: spatial vs. verbal) x 
Time (6 repeated measures: amplitudes of the EPs at 0, 100, 200, 300, 400, 500ms 
in the selective averaging window) x Location (22 repeated measures: electrode 
positions) ANOVA are given in Tab. |5l The sample size is N = 16, since only 8 
persons have been subjected to both the spatial and the verbal condition. On the 
chosen significance level of a = 5% all three main effects as well as all two-way 
combined effects, except the Task x Location effect, and the three-way combined 
effect are highly significant. All corresponding values of the probability of the 
null hypothesis being true are very small, P < 0.001 most of the time, which 
is still true for the Greenhouse-Geisser (given in column eco) adjusted proba- 
bilities Padj.- The significant difference between spatial and verbal task in terms 
of their cortical activity distribution is of course well in line with psychophysio- 
logical literature. The more negative amplitudes at occipital to parietal regions 
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Table 1. ANOVA for spatial versus verbal task. 



N=16 
(8 spatial, 
8 verbal) 




Summary of all effects; Design: 
1-TASK, 2-TIME, 3-LOCATION 




Effect 


df 

Effect 


df 

Error 


eoG 


df a dj. 

Effect 


dfadj. 

Error 


F 


p- level 


Padj. 


1 


1 


15 


1.000 


1.000 


15.000 


33.853 


.000 


.000 


2 


5 


75 


.497 


2.485 


37.280 


6.425 


.000 


.002 


3 


21 


315 


.100 


2.109 


31.639 


8.949 


.000 


.001 


1x2 


5 


75 


.539 


2.695 


40.432 


7.360 


.000 


.001 


1x3 


21 


315 


.115 


2.418 


36.276 


1.551 


.060 


.223 


2x3 


105 


1575 


.031 


3.246 


48.684 


13.917 


.000 


.000 


1x2x3 


105 


1575 


.047 


4.967 


74.510 


13.777 


.000 


.000 



visible in Fig. 0 are as expected and we get the additional information that both 
kinds of information processing are accompanied by a series of activations and 
in-activations. 



Table 2. ANOVA for spatial task, good versus poor performers. 



N=18 
(10 good, 
8 poor) 


Summary of all effects; Design: 
1-PERFORMANCE, 2- TIME, 3-LOCATION 




df 


df 




dfadj. 


df adj. 








Effect 


Effect 


Error 


EGG 


Effect 


Error 


F 


p-level 


padj. 


1 


1 


16 








7.999 


.012 




2 


5 


80 


.502 


2.510 


40.164 


8.369 


.000 


.000 


3 


21 


336 


.131 


2.758 


44.122 


7.187 


.000 


.001 


1 X 2 


5 


80 


.502 


2.510 


40.164 


6.432 


.000 


.002 


1 X 3 


21 


336 


.131 


2.758 


44.122 


3.962 


.000 


.017 


2x3 


105 


1680 


.076 


8.015 


128.238 


10.002 


.000 


.000 


1x2x3 


105 


1680 


.076 


8.015 


128.238 


18.537 


.000 


.000 



Taking this difference into account, data of good vs. poor spatial performers 
were analysed separately within task “spatial” and task “verbal” . The results of 
these Performance (good vs. poor) x Time (6 repeated measures: amplitudes of 
the EPs at 0, 100, 200, 300, 400, 500ms in the selective averaging window) x Lo- 
cation (22 repeated measures: electrode positions) ANOVAs are given in Tab. 0 
for spatial data and in Tab. 0 for verbal data. The sample size for the spatial 
data is N = 18, consisting of 10 good and 8 poor spatial performers. On the 
chosen significance level of a = 5% all main and combined effects are significant, 
both before and after the Greenhouse-Geisser adjustment. For the verbal data, 
on the significance level of a = 5% the main effect for the factor “Performance” 
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is not significant, whereas most of the combined effects still are. Since the two 
performance levels “good” and “poor” represent extreme groups of spatial ability 
selected by psychological testing, they should be discriminable in their EP cor- 
relates during the spatial but not during the verbal task. This is exactly what 
we have found and others uni have also reported and attributed to a higher 
investment of cortical effort visible as a more negative amplitude level of one 
similar pattern. 



Table 3. ANOVA for verbal task, good versus poor performers. 



N=16 
(8 good, 
8 poor) 


Summary of all effects; Design: 
1-PERFORMANCE, 2- TIME, 3-LOCATION 




df 


df 




dfadj. 


dfadj. 








Effect 


Effect 


Error 


CGG 


Effect 


Error 


F 


p- level 


Padj. 


1 


1 


14 








0.440 


.518 




2 


5 


70 


.469 


2.345 


32.832 


19.873 


.000 


.001 


3 


21 


294 


.215 


4.520 


63.276 


105.623 


.000 


.000 


1 X 2 


5 


70 


.469 


2.345 


32.832 


2.916 


.316 


.318 


1 X 3 


21 


294 


.215 


4.520 


63.276 


11.949 


.000 


.000 


2x3 


105 


1470 


.072 


7.611 


106.559 


16.102 


.000 


.000 


1x2x3 


105 


1470 


.072 


7.611 


106.559 


4.045 


.000 


.001 



6 Discussion 

The analysis of cognitive evoked potentials is a largely unsolved problem in 
psychophysiological research. Classical methods are designed for univariate time 
series of simpler motoric or sensoric EPs only and can therefore not really cope 
with the harder problem of analysing cognitive EPs. Nevertheless they are still 
state of the art. 

We have developed a general approach to the visualization of high dimen- 
sional sequential data and the unsupervised discovery of patterns within mul- 
tivariate sets of time series data by combining several intelligent data analysis 
techniques in a novel way. Our method allows the analysis of cognitive evoked 
potentials by finding common multivariate subsequences in a set of EPs which 
have fixed length but variable latencies and are sufficiently similar across all 
EP channels. With this new kind of selective averaging it is possible to better 
analyse the temporal structure of the cognitive processes under investigation. 

We were able to validate our approach both on a statistical basis and in terms 
of the psychophysiological content of the obtained results: we demonstrated sta- 
tistical significance by comparison with results obtained for artificial data and 
by quantifying the gain in noise attenuation; we showed the plausibility of our 
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results by comparing them to what is already known about the psychophysiology 
of the human brain. 

Our general approach to the visualization of high dimensional sequential data 
and the unsupervised discovery of patterns within multivariate sets of time series 
data is of course not restricted to the problem presented in this work. The meth- 
ods described can either be applied to multivariate real valued data by using 
the full approach including the transformation to sequences of discrete symbols 
through vector quantization plus Sammon mapping or, if already symbolic se- 
quences are available, the fixed segment algorithm alone can be applied. Our 
approach is also open to using more advanced techniques of sequence compari- 
son, like e.g. Hidden Markov Models El. 
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Abstract. Data mining methods are designed for revealing significant 
relationships and regularities in data collections. Regarding spatially ref- 
erenced data, analysis by means of data mining can be aptly comple- 
mented by visual exploration of the data presented on maps as well as 
by cartographic visualization of results of data mining procedures. We 
propose an integrated environment for exploratory analysis of spatial 
data that equips an analyst with a variety of data mining tools and pro- 
vides the service of automated mapping of source data and data mining 
results. The environment is built on the basis of two existing systems, 
Kepler for data mining and Descartes for automated knowledge-based 
visualization. It is important that the open architecture of Kepler allows 
to incorporate new data mining tools, and the knowledge-based architec- 
ture of Descartes allows to automatically select appropriate presentation 
methods according to characteristics of data mining results. The paper 
presents example scenarios of data analysis and describes the architec- 
ture of the integrated system. 



1 Introduction 

The notion of Knowledge Discovery in Databases (KDD) denotes the task of 
revealing significant relationships and regularities in data based on the use of 
algorithms collectively entitled ’’data mining”. The KDD process is an iterative 
fulfillment of the following steps [6] : 

1. Data selection and preprocessing, such as checking for errors, removing out- 
liers, handling missing values, and transformation of formats. 

2. Data transformations, for example, discretization of variables or production 
of derived variables. 

3. Selection of a data mining method and adjustment of its parameters. 

4. Data mining, i.e. application of the selected method. 

5. Interpretation and evaluation of the results. 

In this process the phase of data mining takes no more than 20 % of the 
total workload. However, this phase is much better supported methodologically 

D.J. Hand, J.N. Kok, M.R. Berthold (Eds.): IDA’99, LNCS 1642, pp. 149-|i^l3 1999- 
[fc Springer- Verlag Berlin Heidelberg 1999 
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and by software than all others [7]. This is not surprising because performing of 
these other steps is a matter of art rather than a routine allowing automation 
[8]. Lately some efforts in the KDD field have been directed towards intelligent 
support to the data mining process, in particular, assistance in the selection of 
an analysis method depending on data characteristics [2,4]. 

A particular case of KDD is knowledge extraction from spatially referenced 
data, i.e. data referring to geographic objects or locations or parts of a territory 
division. In analysis of such data it is very important to account for the spatial 
component (relative positions, adjacency, distances, directions etc.). However, 
information about spatial relationships is very difficult to represent in discrete, 
symbolic form required for the data mining methods. Known are works on spatial 
clustering [5] and use of spatial predicates [9], but a high complexity of data 
description and large computational expenses are characteristic for them. 



2 Integrated Environment for Knowledge Discovery 

For the case of analysis of spatially referenced data we propose to integrate tra- 
ditional data mining instruments with automated cartographic visualization and 
tools for interactive manipulation of graphical displays. The essence of the idea is 
that an analyst can view both source data and results of data mining in the form 
of maps that convey spatial information to a human in a natural way. This offers 
at least a partial solution to the challenges caused by spatially referenced data: 
the analyst can easily see spatial relationships and patterns that are inaccessible 
for a computer, at least on the present stage of development. In addition, on the 
ground of such integration various KDD steps can be significantly supported. 

The most evident use of cartographic visualization is in evaluation and in- 
terpretation of data mining results. However, maps can be helpful also in other 
activities. For example, visual analysis of spatial distributions of different data 
components can help in selection of representative variables for data mining 
and, possibly, suggest which derived variables would be useful to produce. On 
the stage of data preprocessing a map presentation can expose strange values 
that may be errors in the data or outliers. Discretization, i.e. transformation of a 
continuous numeric variable into one with a limited number of values by means 
of classification, can be aptly supported by a dynamic map display showing spa- 
tial distribution of the classes. With such a support the analyst can adjust the 
number of classes and class boundaries so that interpretable spatial patterns 
arise. 

More specifically, we propose to build an integrated KDD environment on the 
basis of two existing systems, Kepler [11] for data mining and Descartes [1] for 
interactive visual analysis of spatially referenced data. Kepler includes a number 
of data mining methods and, what is very important, provides a universal plug- 
in interface for adding new methods. Besides, the system contains some tools for 
data and formats transformation, access to databases, querying, and is capable 
of graphical presentations of some kinds of data mining results (trees, rules, and 
groups). 
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Descartes Q automates generation of maps presenting user-selected data and 
supports various interactive manipulations of map displays that can help to vi- 
sually reveal important features of the spatial distribution of data. Descartes 
also supports some data transformations productive for visual analysis, and has 
a convenient graphical interface for outlier removal and an easy-to-use tool for 
generation of derived variables by means of logical queries and arithmetic oper- 
ations over existing variables. It is essential that both systems are designed to 
serve the same goal: help to get knowledge about data. They propose different 
instruments that can complement each other and together produce a synergistic 
effect. 

Currently, Kepler contains data mining methods for classification, clustering, 
association, rule induction, and subgroup discovery. Most of the methods require 
selection of a target variable and try to reveal relationships between this variable 
and other variables selected for the analysis. The target variable most often 
should be discrete. Descartes can be effectively used for finding ’’promising” 
discrete variables including, implicitly or explicitly, a spatial component. The 
following ways of doing this are available: 

1. Classification by segmentation of a value range of a numeric variable into 
subintervals. 

2. Cross-classification of a pair of numeric attributes. In both cases the process 
of classification is highly interactive and supported by a map presentation 
of the spatial distribution of the classes that reflects in real time all changes 
in the definition of classes. 

3. Spatial aggregation of objects performed by the user through the map inter- 
face. Results of such an aggregation can be represented by a discrete variable. 
For example, the user can divide city districts into center and periphery or 
encircle several regions, and the system will generate a variable indicating 
to which aggregate each object belongs. 

Results of most of the data mining methods are naturally presentable on 
maps. The most evident is the presentation of subgroups or clusters: painting 
or an icon can designate belonging of a geographical object to a subgroup or 
a cluster. The same technique can be applied for tree nodes and rules: visual 
features of an object indicate whether it is included in the class corresponding 
to a selected tree node, or whether a given rule applies to the object and, if so, 
whether it is correctly classified. 

Since Kepler contains its own facilities for non-geographical presentation of 
data mining results, it would be productive to make a dynamic link between 
displays of Kepler and Descartes. This means that, when a cursor is positioned 
on an icon symbolizing a subgroup, a tree node, or a rule in Kepler, the corre- 
sponding objects are highlighted in a map in Descartes. And vice versa, selection 
of a geographical object in a map results in highlighting the subgroups or tree 
nodes including this object or marking rules applicable to it. 



^ See on-line demos in the Internet at the URL http://allanon.gmd.de/and/java/iris/ 
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The above presented consideration can be summarized in the form of three 
kinds of links between data mining and cartographic visualization: 

— From ’’geography” to ’’mathematics”: using dynamic maps, the user arrives 
at some geographically interpretable results or hypotheses and then tries to 
find an explanation of the results or checks the hypotheses by means of data 
mining methods. 

— From ’’mathematics” to ’’geography”: data mining methods produce results 
that are then visually analyzed after being presented on maps. 

— Linked displays: graphics representing results of data mining in the usual 
(non- cartographic) form are viewed in parallel with maps, and dynamic 
highlighting visually connects corresponding elements in both types of dis- 
plays. 

3 Scenarios of Integration 

In this section we consider several examples of data exploration sessions where 
interactive cartographic visualization and different traditional methods of data 
mining were productively used together in data exploration. 



3.1 Analysis with Classification Trees 

In this session the user works with economic and demographic data about Eu- 
ropean countries 0 . He selects the attribute ’’National product per capita” and 
makes a classification of its values that produced interesting semantic and ge- 
ographic clustering (Fig. QJ. Then he asks the system to investigate how the 
classes are related to values of other attributes. The system starts the C4.5 al- 
gorithm and after about 15 seconds of computations produces the classification 
tree (Fig. EJ. 

It is important that displays of the map and the tree are linked: 

— pointing to a class in the interactive classification window highlights the tree 
nodes relevant to this class (i.e. where this class dominates other classes); 

— pointing to a geographical object on the map results in highlighting of the 
tree nodes representing groups including the object; 

— pointing to a tree node highlights contours of objects on the map that form 
the group represented by this node (generally, in colors of classes) 



3.2 Analysis with Classification Rules 

In this session the user works with a database about countries of the world 0. 
He selects the attribute ’’Trade balance” with an ordered set of values: Import 
much bigger than export, ’’import bigger than export”, ’’import and export are 

^ The data have been taken from CIA World Book 
® The data originate from ESRI world database 
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Fig. 1. Interactive classification of values of the target attribute 
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Fig. 2. The classification tree produced by the C4.5 algorithm 
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approximately equal” , ” export bigger than import” , and ” export much bigger 
than import” . He looks on the distribution of values over the World and does not 
find any regularity. Therefore, he asks the system to produce classification rules 
explaining distribution of values on the basis of other attributes. After short 
computation by the C4.5 method the user receives a set of rules. Two examples 
of the rules are shown in Fig. 0 
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Fig. 3. Classification rules 
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Fig. 4. Visualization of the rule for South America 



For each rule, upon user’s selection, Descartes can automatically produce a 
map that visualizes the truth values of left and right parts of the rule for each 
country. In this map it is possible to see which countries are correctly classified by 
the rule (both parts are true), which are misclassified (the premise is true while 
the consequence is false), and which cases remain uncovered (the consequence is 
true while the premise is false). Thus, in the example map in Fig. 0 (representing 
the second rule from Fig. EJ darker circle sectors indicate truth of the premise 
and lighter ones - truth of the concequence. One can see here seven cases of 
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correct classification marked by signs with both sectors present and two cases of 
non-coverage where the signs have only the lighter sectors. 

The user can continue his analysis with interactive manipulation facilities of 
maps to check the stability of relationships found. Thus, he can try to change 
boundaries of intervals found by the data mining procedure and see whether the 
correspondence between conditions in the left and the right parts of the rule will 
be preserved. 

3.3 Selection of Interesting Snbgroups 

In this session the user wants to analyze the distribution of demographic at- 
tributes over continents. He selects a subset of these attributes and ran the 
SIDOS method to discover interesting subgroups (see some of the results in 
Fig.lSI). For example, the group with ’’Death rate” less than 9.75 and ’’Life ex- 
pectancy for female” greater than 68.64 includes 51 countries (31 % of the World 
countries), and 40 of them are African countries (78 % of African countries). To 
support the consideration of this group, Descartes builds a map (Fig. EJ. The 
map shows all countries satisfying the description of the group. On the map the 
user can see specifically which countries form the group, which of them are in 
Africa and which are in other continents. 
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Fig. 5. Descriptions of interesting subgroups 



It is necessary to stress once again that Descartes does the map design au- 
tomatically on the basis of the knowledge base on thematic data mapping. The 
subgroups found give the user some hints for further analysis: which countries 
to select for closer look; collection of attributes that best characterizes the con- 
tinents; groups of attributes with interesting spatial co-distribution. Thus, if the 
user selects the pair of attributes cited in the definition of the considered group 
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Fig. 6. Visualization of the subgroup 



for further analysis, the system automatically creates a map for dynamic cross- 
classification on the basis of these attributes. The user may find other interesting 
threshold value(s) that leads to clear spatial patterns (Fig.C|). 




Fig. 7. Co-distribution of 2 attributes: ’’Death rate” and ’’Life expectancy, fe- 
male” . Red (the darkest) countries are characterized by high death rate and low 
life expectancy. Green (lighter) countries have small death rates and high life 
expectancy. Yellow (light) countries are characterized by high death rate and 
high life expectancy. 



3.4 Association Rules 

In this session the user studies co-occurrence of memberships in various interna- 
tional organization. Some of them have similar spatial distributions. To find a 
numeric estimation of the similarity the user selected ’’association rules” method. 
The method produced a set of rules concerning simultaneous membership in 
different organizations. Thus, it was found that 136 countries are members of 
UNESCO, and 128 of them (94 %) are also members of IBRD. This rule was 
supported by visualization of membership on automatically created maps. One 
of them demonstrates members of UNESCO not participating in IBRD (Fig. EJ. 

Generally, this method is applicable to binary (logical) variables. It is im- 
portant that Descartes allows to produce various logical variables as results of 
data analysis. Thus, they can be produced by: marking table rows as satisfying 
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Fig. 8. Countries - members of UNESCO and non-members of IBRD 



or contradicting some logical or territorial query, classifying numeric variables 
into two classes, etc. Association rules method is a convenient tool for analysis 
of such attributes. 

3.5 Analysis of Sessions 

It is clear that in all the sessions described above interactive visualization and 
data mining act as complementary instruments for data analysis. Their integra- 
tion supported the iterative process of data analysis: 



Interactive maps 


Data mining 


Interactive maps 


1) Data preview, initial 
hypotheses, 

2) Classification of 
values of attributes 
leading to interesting 
spatial patterns, 

3) Definition of regions 
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1) Relationships 
among attributes, 

2) Characterization 
of regions, 

3) Attribute-based 
grouping of 
spatial objects 
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1) Relate general 
descriptions to individual 
instances 

2) Instances that support or 
contradict discovered 
regularities 

3) Spatial distribution of 
/y discovered grorps 



We should stress the importance of knowledge-based map design in all stages 
of the analysis. The ability of Descartes to automatically select presentation 
methods makes it possible for the user to concentrate on problem solving. 

Generally, for the first prototype we selected only high-speed data mining 
methods to avoid long waiting time. However, currently there is a strategy in 
the development of data mining algorithms to create so called any time methods 
that can provide rough results after short computations and improve them with 
longer calculations. The open architecture of Kepler allows to add such methods 
later and to link them with map visualizations of Descartes. 

One can note that we applied the system to already aggregated relatively 
small data sets. However, even with these data the integrated approach shows its 
advantages. Later we plan to extend the approach to large sets of raw data. The 
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main problem is that maps are typically used for visualization of data aggregated 
over territories. A solution may be through automated or interactive aggregation 
of raw data and of results of data mining methods. 

4 Software Implementation 

The software implementation of the project is supported by the circumstance 
that both systems have client-server architecture and use socket connections and 
TCP/IP protocol for the client-server communication. The client components of 
both systems are realized in the Java language and provide the user interface. 
To couple the two systems, we implemented an additional link between the two 
servers. The Descartes server activates the Kepler server, establishes a socket 
connection, and commands Kepler to load the same application (workspace). 

In the current implementation, the link between the two systems can be 
activated only in one direction: working with Descartes, the user can make Kepler 
apply some data mining method to selected data. A list of applicable methods is 
available to the user depending on the context (how many attributes are selected, 
what are their types, etc.). The selection of appropriate data analysis methods 
is done on the basis of an extension to the current visualization knowledge base 
existing in Descartes. 
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Fig. 9. The architecture of the integrated system 



The link to data mining is available both from a table window and from some 
types of maps. Thus, classification methods (classification trees and rules) as 
well as subgroup discovery methods are available both from a table containing 
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qualitative attribute (s) and from maps for interactive classification or cross- 
classification. The association rules method is available from a table with several 
logical attributes or from a map presenting such attributes. 

When the user decides to apply a data mining method, the Descartes client 
allows him to specify the scope of interest (choose a target variable or value 
when necessary, select independent attributes, specify method-specific parame- 
ters, etc.) and then sends this information to the Descartes server. The server 
creates a temporary table with selected data and commands the Kepler server 
to import this table and to start the specified method. After finishing the com- 
putations, the Kepler server passes the results to the Kepler client, and this 
component visualizes the results. At this point a new socket connection between 
the Descartes client and the Kepler client is established for linking of graphics 
components. This link provides simultaneous highlighting of active objects on 
map displays in Descartes and graphic displays in Kepler. 

Results of most data mining methods can be presented by maps created in 
Descartes. For this purpose the Kepler server sends commands to the Descartes 
server to activate map design, and the Descartes client displays the created maps 
on the screen. 

5 Conclusions 

To compare our work with others, we may note that exploratory data analysis has 
been traditionally supported by visualizations. Some work was done on linking 
of statistical graphics built in xGobi package with maps displayed in ArcView 
GIS [3] and on connecting clustering dendrograms with maps [10]. However, 
all previous works we are aware of utilize only a restricted set of predefined 
visualizations. 

In our work we extend this approach by integrating data mining methods 
with knowledge-based map design. This allows us to create a general mapping 
interface for data mining algorithms. This feature together with the open archi- 
tecture of Kepler gives an opportunity to add new data mining methods without 
system reengineering. 
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Abstract. The visualization of large text databases and document col- 
lections is an important step towards more flexible and interactive types 
of information retrieval. This paper presents a probabilistic approach 
which combines a statistical, model-based analysis with a topological 
visualization principle. Our method can be utilized to derive topic maps 
which represent topical information by characteristic keyword distri- 
butions arranged in a two-dimensional spatial layout. Combined with 
multi-resolution techniques this provides a three-dimensional space for 
interactive information navigation in large text collections. 



1 Introduction 

Despite of the great enthusiasm and excitement our time shows for all types of 
new media, it is indisputable that the most nuanced and sophisticated medium 
to express or communicate our thoughts is what Herder calls the ‘vehiculum of 
our thoughts and the content of all wisdom and knowledge ’|3 ~ our language. 
Consequently, prodigious benefits could result from the enhanced circulation 
and propagation of recorded language by todays digital networks, which make 
abundant repositories of text documents such as electronic libraries available to 
a large public. Yet, the availability of large databases does not automatically 
imply easy access to relevant information, since retrieving information from a 
glut of nuisance data can be tedious and extremely time consuming. 

What is urgently needed are navigation aids, overlooks which offer uncompli- 
cated and fast visual access to information, and maps that provide orientation, 
possibly on different level of resolution and abstraction. This paper deals with a 
statistical approach to provide such overlooks and maps for large collections of 
text documents. It aims at a concise visualization of conceptual and topical sim- 
ilarities between documents or aspects of documents in the form of topic maps. 
The proposed method has two building blocks: 

i. A latent semantic analysis technique for text collections which models 
context-dependent word occurrences. 

ii. A principle of topology preservation HH which allows to visualize the ex- 
tracted information, for example, in the form of a two-dimensional map. 



D.J. Hand, J.N. Kok, M.R. Berthold (Eds.): IDA’99, LNCS 1642, pp. 161-^^21 1999- 
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Herein, data analysis and visualization are not treated as separate procedural 
stages; as we will discuss in more detail later on, it is a benefit of our procedure 
that it unites both problems. This is formally achieved by optimizing a single 
objective function which combines a statistical criterion with topological con- 
straints to ensure visualization. This coupling makes sense, whenever the final 
end is not the analysis per se, but the presentation and visualization of regu- 
larities and patterns extracted from data to a user. As a general principle, the 
latter implies that the value of an analysis carried out by means of a machine 
learning algorithm depends on whether or not its results can be represented in a 
way which makes it amenable to human (visual) inspection and allow an effort- 
less interpretation. Obviously it can be of great advantage, if this is taken into 
account as early as possible in the analysis and not in a post hoc manner. 

Our approach is somewhat related in spirit to the WEBSOM learning archi- 
tecture m which continues earlier work on semantic maps and performs 
a topological clustering of words represented as context-vectors. However, the 
method presented here is based on a strictly probabilistic data model which is 
fitted by maximum likelihood estimation. The discrete nature of words is di- 
rectly taken into account without deviation via a (randomized) vector space 
representation as in the WEBSOM. In addition, our model does not perform 
word clustering^ but models topics via word distributions. 

The rest of the paper is organized as follows: Section El briefly introduces a 
probabilistic method for latent semantic analysis which is then extended to 
incorporate topological constraints in Section 0 Finally, Section 0 shows some 
exemplary results of multi-resolution maps extracted from document collections. 

2 Probabilistic Latent Semantic Analysis 

2.1 Data Representation 

Probabilistic Latent Semantic Analysis (PLSA) 0|7| is a general method for sta- 
tistical factor analysis of two-mode and count data which we apply here to learn- 
ing from document collections. Formally, text collections are represented as pairs 
over a set of documents V = {di, . . . , d^} and a set of words W = {u>i, . . . , u>m}, 
i.e, the elementary observations we consider are of the form (d, w), denoting the 
occurrence of a word w in a document d. Summarizing all observations by counts 
n(d, w) of how often a word occurred in a document, one obtains a rectangular 
N hy M matrix N = [n(di, Wj)]ij which is usually referred to as term-document 
matrix. The key assumption of this representation is the so-called ‘bag-of- words’ 
view which presupposes that conditioned on the identity of a particular docu- 
ment, word occurrences are statistically independent. This also the basis for the 
popular vector-space model of documents m and it is known that N will in 
many cases preserve most of the relevant information, e.g., for tasks like text 
retrieval based on keywords, which makes it a reasonable starting point for our 
purposes. 

The term-document matrix immediately reveals the problem of data sparse- 
ness, which is one of the problems latent semantic analysis aims to address. A 
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typical matrix derived from short texts like news stories, book summaries or 
paper abstracts may only have a tiny fraction of non-zero entries, because just 
a small part of the vocabulary is typically used in a single document. This has 
consequences, in particular for methods that are evaluating similarities between 
documents by comparing or counting common terms. The main goal of PLSA in 
this context is to map documents and words to a more suitable representation in 
a probabilistic latent semantic space. As the name suggests, the representation 
of documents and terms in this space is supposed to make semantic relations 
more explicit. PLSA is an attempt to achieve this goal in a purely data driven 
fashion without recourse to general linguistic knowledge, i.e, based exclusively 
on a document collection or corpus at hand. Given these expectations could be 
met, PLSA would offers great advantages in terms of flexibility as well as in 
terms of domain adaptivity. 

2.2 Probabilistic Latent Semantic Analysis 

PLSA is based on a latent class model which associates an unobserved class 
variable z £ Z = {zi, . . . , zk} with each observation (d, w). As will be explained 
in more detail, the intention pursued by introducing latent variables is to model 
text topics such that each possible state z G Z would ideally represent one 
particular topic or subject. Formally, let us define the following multinomial 
distributions: P{d) is used to denote the probability that a word is observed in 
a particular documentEI P{w\z) denotes a word distributions conditioned on the 
latent class variable z, which represent different topic factors. Finally, P{z\d) is 
used to denote document-specific distributions over the latent variable space Z. 
We may now define the following probabilistic model over V xW 

P{d,w) = P{d)P{w\d), where P{w\d) = '^^P{w\z)P{z\d) . (1) 

This model is based on a crucial conditional independence assumption, namely 
that d and w are independent conditioned on the state of the latent variable z 
associated with the observation {d,w). As a result, the conditional distributions 
P{w\d) in are represented as convex combinations of the K factors P{w\z). 
Since in the typical case one has K N , the latent variable z can be thought 
of as a bottleneck variable in predicting words conditioned on documents. 

To demonstrate how this corresponds to a mixture decomposition of the 
term-document matrix, we switch to an alternative parameterization by apply- 
ing Bayes’ rule to P{z\d) and arriving at 

P(d,u;) = ^P(z)P(d|z)P(zc|z), (2) 

z^Z 

which is perfectly symmetric in both entities documents and words. Based on 021) 
let us formulate the probability model (^3) in matrix notation, by defining U = 
[P{di\zk)]i,k, V = [P{wj\zk)]j,k, A = diag[P(zfe)]fc, so that P = [P{d,,Wj)]tj = 

^ This is intended to account for varying document lengths. 
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UZ'V*. The algebraic form of this decomposition corresponds exactly to the de- 
composition of N obtained by Singular Value Decomposition (SVD) in standard 
Latent Semantic Analysis (LSA) 0. However, the statistical model fitting prin- 
ciple used in conjunction with PLSA is the likelihood principle, while LSA is 
based on the Frobenius or L 2 -norm of matrices. The statistical approach offers 
important advantages since it explicitly aims at minimizing word perplexitjQ. 
The mixture approximation P of the co-occurrence table is a well-defined prob- 
ability distribution and factors have a clear probabilistic meaning in terms of 
mixture component distributions. In contrast, LSA does not define a properly 
normalized probability distribution and the obtained approximation may even 
contain negative entries. In addition, the probabilistic approach can take advan- 
tage of the well-established statistical theory for model selection and complexity 
control, e.g, to determine the optimal number of latent space dimensions (cf. |E|). 
Last but not least, the statistical formulation can be systematically extended and 
generalized in various ways, an example being the model presented in Section El 
of this paper. 



2.3 EM Algorithm for PLSA 



In order fit the model in dU we follow the statistical standard procedure and 
perform maximum likelihood estimation with the EM algorithm iiini One has 
to maximize 



^ = EE n{d,w) \ogP{d,w) 
deT> wew 



( 3 ) 



with respect to all multinomial distributions which define P{d,w). EM is guar- 
anteed to find a local maximum of £ by alternating two steps: (i) an expectation 
(E) step where posterior probabilities for the latent variables are computed based 
on the current estimates of the parameters, (ii) a maximization (M) step, where 
parameters are updated based on the posterior probabilities computed in the 
E-step. For the E-step one simply applies Bayes’ formula, e.g., in the parame- 
terization of to obtain 



P{z\d,w) 



P{z\d)P{w\z) 

eEEEww) ■ 



( 4 ) 



It is straightforward to derive the M-step equations jS] 



P{w\z) (X n{d,w)P{z\d,w), P{z\d) oc n{d,w)P{z\d,w) . (5) 



deT> 



luG W 



The estimation of P{d) oc carried out independently. Alter- 

nating and 0 initialized from randomized starting conditions results in a 
procedure which converges to a local maximum of the log-likelihood in (0 . 
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Fig. 1. The 3 latent factors to most likely generate the word ‘segment’, derived 
from a, K = 128 PLSA of the CLUSTER document collection. The displayed 
terms are the most probable in the class-conditional distribution P(w\z). 



2.4 Example: Analysis of Word Usage with PLSA 

Let us briefly discuss an elucidating example application of PLSA at this point. 
We have run PLSA with 128 factors on two datasets: (i) CLUSTER: a collection 
of paper abstracts on clustering and (ii) the TDTl collection (cf. Section E| for 
details). 

As a particularly interesting term in the CLUSTER domain we have chosen 
the word ‘segment’. Figured shows the most probable words of 3 out of the 128 
factors which have the highest probability to generate the term ‘segment’. This 
sketchy characterization reveals very meaningful sub-domains: The first factor 
deals with image processing, where “segment” refers to a region in an image. The 
second factor describes speech recognition where “segment” refers to a phonetic 
unit of an acoustic signal such as a phoneme. The third factor deals with video 
coding, where “segment” is used in the context of motion segmentation in image 
sequences. The factors thus seem to capture relevant topics in the domain under 
consideration. 

Three factors from the decomposition of the TDTl collections with a high 
probability for the term “UN” are displayed in Figure 0 . The vocabulary clearly 
characterizes news stories related to certain incidents in the period of 1994/1995 
covered by the TDTl collection. The first factor deals with the war in Bosnia, 
the second with UN sanctions against Iraq, and the third with the Rwandan 
genocide. These example shows that the topic identified by PLSA might also 
correspond to something one might more appropriately refer to as events. De- 

^ Perplexity is a term from statistical language modeling which is utilized here to refer 
to the (log-averaged) inverse predictive probability 1/P(w\d). 
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Fig. 2. Three factors to most likely generate the word “UN” from a 128 factor 
decomposition of the TDTl corpus. 



pendent on the training collection and the specific domain the notion of topic 
has thus to be taken in a broader sense. 



2.5 PLSA: What Is Missing? 

From the example in Figure d one can see that the factors P{w\z) extracted 
by PLSA provide a fairly concise description of topics or events^ which can 
potentially be utilized for interactive retrieval and navigation. However, there is 
one major drawback: assuming that for large text collections one would like to 
perform PLSA with a latent space dimensionality of the order of several hundreds 
or even thousands, it seems inappropriate to expect the user to examine all 
factors in search for relevant documents and topics of interest. Of course, one 
may ask the user to provide additional keywords to narrow the search, but this 
is nothing more than an ad hoc remedy to the problem. 

What is really missing in PLSA as presented so far, is a relationship between 
the different factors. Suppose for concreteness one had identified a relevant topic 
represented by some P{w\z)] the identity of z does not provide any information 
about whether or not another topic P{w\z') could be relevant as well. The gen- 
eralization we present in the following section, extends the PLSA model in a 
way that enables it to captures additional information about the relationships 
between topics. In the case of a two-dimensional map, this results in a spatial 
arrangement of topics on a two-dimensional grid, a format which may support 
different types of visualization and navigation. Other topologies can be obtained 
by exactly the same mechanism described in the sequel. 
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3 Topological PLSA 

In order to extend the PLSA model in the described way, we make use of a 
principle that was originally proposed in the seminal work of Kohonen on Self- 
Organizing Maps (SOM) 0| • While the formulation of the algorithm in m 

was heuristic and mainly motivated in a biological setting, several authors have 
subsequently proposed modifications which have stressed an information theo- 
retic foundation of the SOM and pointed out the relations to vector quantization 
for noisy communication channels (cf. Moreover, it has been noticed 

m that the topology-preserving properties of the SOM are independent of the 
vectorial representation, most research on the SOM has been focusing on. 



3.1 Topologies from Confusion Probabilities 



The key step in the proposed generalization is to introduce an additional latent 
variable v G Z of the same cardinality as z to define the probability model 

P{d, w) = P{d)P{w\d), P{w\d) = P{w\z) Yl P{z\v)P{v\d) . (6) 



It is straightforward to verify that from a purely statistical point of view this 
does not offers any additional modeling power. Whatever the choice for P{z\v) 
and P{v\d) might be, one can simply define P{z\d) — P{z\v)P{v\d) to obtain 
exactly the same distribution over I? x W in the more parsimonious model of 
dU- Yet, we do not propose to fit all model parameters in (0 from training data, 
but to fix the confusion probabilities P{z\v) to prespecified values derived from 
a neighborhood function in the latent variable space Z. We will focus on means 
to enforce a topological organization of the topic representations P(w\z) on a 
two-dimensional grid with boundaries. Let us introduce the notation z{x,y), 
1 < a;, y < L, a:, 2 / e IN to identify latent states z{x, y) G Z with points (a;, y) on 
the grid. By the Euclidean metric, this embedding induces a distance function 
on Z, namely 

d{z{x, y), z(x', y')) = d{{x, y), (a;', y')) = y/{x- a;')^ + {y ~ v'Y- (7) 



Now we propose to define P{z\v) via a Gaussian with standard deviation a 



exp [— d(z, u)^/(2(t^)] 
E^'exp[-d(z','(;)2/(2(T2)] ’ 



( 8 ) 



where a is assumed to be fixed for now. To understand why this favors a topo- 
logical organization of topics, consider a document d with its topic distribu- 
tion P(v\t). The confusion probabilities tilt this distribution to a distribution 

® We use this terminology, because the relationship between z and v can be thonght 
of in terms of a communication scenario: v represents the original message and z the 
message received after sending it via a noisy channel. P{z\v) then correspond to the 
channel characteristic, i.e., how probable it is to receive z after sending v. 



168 



Thomas Hofmann 



P{z\d) = P{z\v)P{v\d). For simplicity assume that P{v\d) = 1 for a particu- 

lar V G Z, then the confusion probabilities will blend-in additional contributions 
mainly from neighboring states z of u on the two-dimensional grid. If these 
neighboring states represent very different topics, the resulting word distribu- 
tion P{w\d) in ® will significantly deviate from the distribution one would get 
from o, which - assuming that P{v\d) was chosen optimal - will result in a poor 
estimate. If on the other hand the neighbors of v represent closely related top- 
ics, this deviation will in general be much less severe. A meaningful topological 
arrangement of topics will thus pay off in terms of word perplexity. 



3.2 EM Algorithm for Topological PLSA 



The next step consists in deriving the EM equations for topological PLSA. Stan- 
dard calculations yield the M-step re-estimation formulae 



P{w\z) (x''^n{d,w)P{z\d,w), and P{v\d) (x''^n{d,w)P{v\d,w) . (9) 

d w 



For the evaluation of Q the marginal posterior probabilities are sufficient and 
it is not necessary to compute the joint posterior P(v, z\d,w). The marginal 
posterior probabilities are given by 



P{v\d, w) = ^ P(u, z\d, w) 

Z 

P{z\d,w) = ^P{v,z\d,w) 

V 



P{v\d)P{w\v) 

P{P\d)P{w\v')' 

P{z\d)P{w\z) 

P(z'\d)P{w\z')' 



and 



( 10 ) 

( 11 ) 



where P{w\v) = P{z\d) = ^(-^I^)^(^M)- Notice also 

that the marginal posteriors are simply related by 



P{v\d,w) = ''^P{v\z)P{z\d,w), P{z\d,w) = ''^P{z\v)P{v\d,w) . (12) 

Z V 

In summary, one observes that the EM algorithm for topological PLSA re- 
quires the computation of marginal posteriors and document /word conditionals 
for both variables v and z. Moreover, these quantities are related by a simple 
matrix multiplication with the confusion matrix [P{zk\vi)]k,i or its counterpart 
[P{vk\zi)]k,i- 



3.3 Topologies and Hierarchies 

There are two ways in which hierarchies are of interest in the context of topo- 
logical PLSA: (i) To accelerate the PLSA by a multi-resolution optimization 
over a sequence of coarsened grids, (ii) To improve the visualization by offering 
multiple levels of abstraction or resolution on which the data can be visualized. 

A significant computational improvement can be achieved by performing 
PLSA on a coarse grid, say starting on a 2 x 2 grid, and then recursively prolon- 
gating the found solution according to an quadtree-like scheme. This involves 
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Fig. 3. Multi-resolution visualization of the CLUSTER collection with grid maps 
at2x24x,8x8 (upper left corner), and 16 (upper left corner). Subfigure 
(3) shows the 4x4 subgrid obtained by zooming the marked 2x2 window 
in subfigure (2). Similarly, subfigure (4) is a zoomed-in version of the marked 
window in subfigure (3). 



copying the distributions P{w\z) — with a small random disturbance - to the suc- 
cessors of 2 on the finer grid and distributing P{v\d) from the coarse level among 
its four successor states on the finer grid. This procedure has the additional ad- 
vantage that it often leads to better topological arrangements, since it is less 
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sensitive to ‘topological defects ’0 The multi-resolution optimization is coupled 
with a schedule for cr, which defines the length-scale for the confusion probabil- 
ities in (0. In our experiments we have utilized a schedule (t„ = (1/ V2)"'0'o, 
where m corresponds to the number of iterations performed at a particular res- 
olution level, i.e, after m iterations we have <Tn+m = (l/2)crm- Prolongations to 
a finer grid is performed at iterations n = m, 2m, 3m, .... 

Notice that the topological organization of topics has the further advantage 
to support a simple coarsening procedure for visualization at different resolution 
levels. The fact that neighboring latent states represent similar topics suggests to 
merge states, e.g., four at a time, to generate a coarser map with word distribu- 
tions P{w\z) obtained by averaging over the associated distributions on the finer 
grid with the appropriate weights P{z). One can thus dynamically navigate in 
a three-dimensional information space: vertical between topic maps of different 
resolution and horizontally inside a particular two-dimensional topic map. 



4 Experimental Results 

We have utilized two document collections for our experiments: (i) the TDTl 
collection (Topic Detection and Tracking, distributed by the Linguistic Data 
Consortium H3|) with 49,225 transcribed broadcast news stories, (ii) a collection 
of 1,568 abstract of research papers on ‘clustering’ (CLUSTER). All texts have 
been preprocessed with a stop word list, in addition very infrequent words with 
less than 3 occurrences have also been eliminated. For the TDTl collection word 
frequencies have been weighted with an entropic term weight US). The 5 most 
probable words in factors P{w\z) have been utilized for visualization and are 
displayed at the position corresponding to the topic on the two-dimensional 
grid to produce topic maps. In an interactive setting one would of course vary 
the number of displayed terms according to the user’s preferences. 

A pyramidal visualization of the CLUSTER collection based on a 256 factor, 
16 X 16 topological PLSA is depicted in Figure El One can see that a meaningful 
coarsened maps can be obtained from the 16 x 16 map, different areas like as- 
tronomy, physics, databases, and pattern recognition can be easily identified. In 
particular on the finer levels, the topological organization is very helpful where 
the relation of different subtopics in signal processing, including image process- 
ing and speech recogniton, is well-preserved by the topic map. A similar map 
hierarchy for the TDTl collection is depicted in Figure 01 Different topics and 
events can effortlessly be identified from the word distributions. Again, subtopics 
like the ones dealing with different events of international politics are mapped 
to neighboring positions on the lattice. 



There is a large body of literature dealing with the topology-preserving properties 
of SOMs. The reader is referred to HH and the references therein. 
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Fig. 4. Multi-resolution visualization of the TDTl collection with grid maps at 
2 X 2 4x, 8 X 8 (upper left corner), and 16 (upper left corner). Subfigure (3) shows 
the 4x4 subgrid obtained by zooming the marked 2x2 window in subfigure (2). 
Similarly, subfigure (4) is a zoomed-in version of the marked window in subfigure 
(3). 



5 Conclusion 

We have presented a novel probabilistic technique for visualizing text databases 
by topic maps. The main advantages are (i) a sound statistical foundation on a 
latent class model with EM as a fitting procedure, (ii) the principled combina- 
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tion of probabilistic modeling and topology-preservation, and (iii) the natural 
definition of resolution hierarchies. The benefits of this approach to support in- 
teractive retrieval have been demonstrated briefly with simple two-dimensional 
maps, however, since arbitrary topologies can be extracted, one might expect 
even more benefits in combination with more elaborate interfaces. 
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Abstract. Grand tour is a method for viewing multidimensional data 
via linear projections onto a sequence of two dimensional subspaces and 
then moving continuously from one projection to the next. This paper 
extends the method to 3D grand tour where projections are made onto 
three dimensional subspaces. 3D cluster-guided tour is proposed where 
sequences of projections are determined by cluster centroids. Cluster- 
guided tour makes inter-cluster distance-preserving projections under 
which clusters are displayed as separate as possible. Various add-on fea- 
tures, such as projecting variable vectors together with data points, inter- 
active picking and drill down, and cluster similarity graphs, help further 
the understanding of data. A CAVE virtual reality environment is at our 
disposal for 3D immersive display. This approach of multidimensional 
visualization provides a natural metaphor to visualize clustering results 
and data at hand by mapping the data onto a time- indexed family of 3D 
natural projections suitable for human eye’s exploration. 



1 Introduction 

Visualization techniques have proven to be of high value in exploratory data 
analysis and data mining. For data with a few dimensions, scatterplot is an 
excellent means for visualization. Patterns could be efficiently unveiled by simply 
drawing each data point as a geometric object in the space determined by one, 
two or three numeric variables of the data, while its size, shape, color and texture 
determined by other variables of the data. The ability to draw scatterplots is a 
common feature of many visualization systems. Conventional scatterplots lose 
their effectiveness, however, as dimensionality of data becomes large. 

An idea comes out, then, to project higher dimensional data orthogonally 
onto lower dimensional subspaces. It allows us to look at multidimensional data 
in a geometry that is within the perceptibility of human eyes. Since there is 
an infinite number of possibilities to project high dimensional data onto lower 
dimensions, and information will eventually lose after the projection, the grand 
toiir [H3j and other projection pursuit techniniies jl 1)1 1 'Jf aim at automatically 
finding the interesting projections or at least helping the users to find them. 

Grand tour is an extension of data rotation for multidimensional data sets. 
It is based on selecting a sequence of linear projections and moving continu- 
ously from one projection to the next. By displaying a number of intermediate 
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projections obtained by interpolation, the entire process creates an illusion of 
continuous, smooth motion through multidimensional displays. This helps to 
find interesting projections which is hard to find in the original data, owing to 
the curse of dimensionality. Furthermore, grand tour allows viewers to easily keep 
track of a specific group of data points throughout a tour. By examining where 
the data points go from one projection to the next, viewers have a much better 
understanding about data than using conventional visualization techniques such 
as bar charts or pie charts. 

Now the question becomes how to choose “meaningful” projections and pro- 
jection sequences to maximize the chance of finding interesting patterns. One 
simple way is choosing the span of any three arbitrary variables as a 3D sub- 
space and then moving from this span to the next span of another three variables. 
This is what we call “simple projection” . Each projection in the sequence is a 
3D scatterplot of three variables. It is more than the 3D scatterplots, however, 
because more information could be unveiled by the animation moving from one 
projection to the next. Another straightforward way is random tour. By choosing 
randomly a 3D subspace and moving to the next randomly chosen 3D subspace, 
random tour creates a way for global dynamic browsing of multidimensional 
data. In the data preprocessing stage of a data mining project, simple projec- 
tion and random tour are efficient ways to examine the distribution of values of 
each variable, the correlations among variables, and to decide which variables 
should be included in further analysis. Although real world databases have often 
many variables, these variables are often highly correlated, and databases are 
mercifully inherently low-dimensional. Simple projection and random tour are 
useful to identify the appropriate subspaces in which further mining is meaning- 
ful. 

There are various ways of choosing interesting projections and projection 
sequences in a tour. For clustered data sets, one promising way is to use positions 
of data clusters to help choosing projections. Let us assume that a data set is 
available as data points in the p-dimensional Euclidean space and has been 
clustered into k clusters. Each cluster has a centroid which is simply an average 
of all the data points contained in the cluster. As we know, any four distinct 
and non-colinear points uniquely determine a 3D subspace. If we choose the 
centroids of any four clusters and project all data points onto a 3D subspace 
determined by these four cluster centroids, the Euclidean distance between any 
two of the four cluster centroids will be preserved and the four clusters will be 
displayed as separate as possible from each other. We call this a cluster-guided 
projection. Observe that there are (|) possible cluster-guided projections. By 
using the grand tour to move from one cluster-guided projection to another, a 
viewer can have quickly a good sense of the positions of all data clusters. 

There were both linear and nonlinear techniques for dimension reduction 
of high dimensional data. Rather than nonlinear techniques such as Sammon’s 
projection [E] which aims at preserving all inter-cluster distances by minimiz- 
ing a cost function, we found linear projections more intuitive for the purpose 
of unveiling cluster structure and suitable for human eye’s exploration. Linear 
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projections and scatterplots could be found in many visualization systems (for 
example, the earlier Biplot The idea of using grand tour of lower dimen- 
sional projections to simulate higher dimensional displays was first proposed 
in Techniques were developed to design the path of a tour, for example, to 
principal component and canonical variate suhsr)a,ce|l ,3|. or to hill-climbing paths 
that follows gradients of projection pursuit indices l?)ll(l . An example visualiza- 
tion system which implements 2D projections and grand tour is XGobi im . For 
the visualization of data clusters, a 2D cluster-guided tour was proposed in |^. 

To exploit human eyes’ 3D nature of visual perception, we developed a visu- 
alization system for 3D projection and cluster-guided tour. A CAVE immersive 
virtual environment [ blVj is at our disposal for 3D immersive display. With the 
CAVE as a 3D “magic canvas” , scatterplots can be drawn in mid-air in the 3D 
virtual space. This helps greatly data analysts visualize data and mining results. 
It helps to show 3D distributions of data points, locate similarity or dissimilarity 
between various clusters, and furthermore, determine which clusters to merge or 
to split further. Compared with other systems mentioned above, the grand tour 
in the CAVE virtual environment has characteristics such as: (1) 3D projection; 
(2) immersive virtual reality display; (3) cluster-guided projection determined 
by 4 data clusters; and (4) vary intuitive add-on tools for interaction and drill- 
down. It represents a novel tool to visualize multidimensional data and is now 
routinely employed for preprocessing data and analyzing mining results. It is 
also used to visually communicate mining results to clients. 

The paper is organized as follows: Section Elis to introduce grand tour . Sec- 
tion 01 discusses in detail the 3D cluster-guided projections and cluster-guided 
tour. Section 0 is for projection rendering inside the CAVE virtual environ- 
ment. Section 01 presents add-on features such as projecting variable vectors to- 
gether with data points, interactive picking and drill down, and cluster similarity 
graphs. Section 01 concludes the paper with future work and directions. 

2 Grand Tour 

For easy illustration, suppose we are to make a 2D tour in 3D Euclidean space 
(Fig.P). A 2D oriented projection plan, or a 2-frame (a 2-frame is an orthonormal 
pair of vectors), can be identified by a unit index vector that is perpendicular 
to the plan. The most straight way to move from one 2D projection to the next 
is a sequence of interpolated projections to move the index vector to the next 
index vector on the unit sphere along a geodesic path. 

For 3D grand tour of p-dimensional {p > 3) data sets, in the same way, it 
is necessary to have an explicitly computable sequence of interpolated 3-frames 
in p-dimensional Euclidean space. The p-dimensional data is then projected, in 
turn, onto the 3D subspace spanned by each 3-frame. For the shortest path 
to move from one 3D projection to another, the sequence of the interpolated 
3-frames should be as straight as possible. Here “straight” means: If we think 
of the interpolated 3D subspaces as being evenly-spaced points on a curve in 
the space of 3D subspaces through the origin in Euclidean p-space (a so-called 
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“Grassmannian manifold”) (Fig. EJ, we should be able to choose that curve so 
that it is almost a geodesic. 

Moving along a geodesic path creates a sequence of intermediate projections 
moving smoothly from the current to the target projection. This is a way of 
assuring that the sequence of projections is both comprehensible, and also that 
it moves rapidly to the target projection. For 3D projections, a geodesic path 
is simply a rotation in the (at most) 6-dimensional subspace containing both 
the current and the target 3D spaces. This implies that some pre-projection is 
necessary in implementation so that computing data projections is within the 
joint span of the current and the next 3D subspaces, the dimension of which can 
be substantially smaller than p. Various smoothness properties of such geodesic 
paths are explored in great detail in |3|. For a description of implementation 
details, see H31 Subsection 2.2.1]. 

3 3D Cluster-Guided Projection and Cluster-Guided 
Tour 

Let {VdiLi denote a data set, that is, a set of n data points each taking values 
in the p-dimensional Euclidean space R^, p > 3. Let X -Y denote the dot product 
of two points X and Y. Write the Euclidean norm of V as ||V|| = \/X ■ X, and 
the Euclidean distance between X and Y as d{X,Y) = jjV — Vj]. Let us suppose 
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one projection 
in the tour 




in the tour 



Fig. 2. A path of intermediate plans that interpolates a sequence of projection 
plans. 



that we have partitioned the data set into k clusters, fc > 4, and let 
denote the cluster centroids. 

Any four distinct and non-colinear cluster centroids Ca, Cb, Cc and Cd in 
{Cj}j^i determine an unique 3D subspace in RP. Let Ki,K 2 and constitute 
an orthonormal basis of the subspace (this could be obtained by orthonormal- 
izing Cb — C a, Cc — Ca, and Cd — Ca)- We can then compute a 3D projec- 
tion by projecting the data set onto the 3-frame {Ki, K 2 , K 3 ). This 

projection preserves the inter-cluster distances, that is, the Euclidean distance 
between any two of the four cluster centroids {Ca,Cb,Cc,Cd} is preserved af- 
ter the projection. Specifically, let X\p = {X ■ K\,X ■ K 2 ,X ■ K 3 ) denote the 
3D projection of a p-dimensional point X, then d{X\p^Y\p) = d{X,Y) for any 
A, y G {CaTCbTCcTCd}- This inter-cluster-distance-preserving projection is a 
right perspective of view that these four clusters are visualized as far as possible 
(Fig.®. 

There are various ways to choose the path (sequence of projections) of tour. 
One way is to simply choose a tripod from the variable unit vectors of p- 
dimension as the axes of one 3D projection and move from this projection to 
the next whose axes are another tripod. This is what we call “simple projec- 
tion” (Fig. Oj). It gives a way to continuously check a sequence of scatterplots of 
data against any three variables. Another straightforward way is random tour 
where each projection in the sequence is randomly generated. This gives a way 
for global dynamic browsing of multidimensional data. 

Cluster-guided tour is a way to get cluster centroids involved in choosing 
projection sequences: Given k cluster centroids, there are at most (f) combi- 
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Fig. 3. Simple projection: a 3D scatterplot. 



nations of unique 3D cluster projections. Each projection allows us to visualize 
the multidimensional data in relation to four cluster centroids. To visualize the 
multidimensional data in relation to all cluster centroids, we display a sequence 
of cluster-guided projections and use grand tour to move continuously from one 
projection in the sequence to the next. 

The basic idea behind cluster-guided tour is simple: Choose a target pro- 
jection from ( 4 ) possible cluster-guided projections, move smoothly from the 
current projection to the target projection, and continue. We illustrate the 3D 
cluster-guided projection and guided-tour on the Boston housing data set from 
UCI ML Repository^. This data set has n = 506 data points and p = 13 
real-valued attributes. The data set is typical (not in size, but in spirit) of the 
data sets routinely encountered in market segmentation. The 13 attributes mea- 
sure various characteristics such as the crime rate, the proportion of old units, 
property tax rate, pupil-teacher ratio in schools, etc., that affect housing prices. 
We normalized all the 13 attributes to take values in the interval [0, 1]. To en- 
able the cluster-guided tour, any clustering algorithm could be used to cluster 
the data set. Here we clustered the data set into 6 clusters by the Kohonen’s 
Self-Organizing MapJT^. The six result clusters have 114, 46, 29, 107, 78, and 
132 data points respectively. There are ( 4 ) = 15 possible 3D cluster-guided pro- 
jections. We plot one of them in Fig. 0 To underscore the 3D cluster-guided 
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Fig. 4. A 3D cluster-guided projection determined by centroids (big balls with 
labels) of Clusters 1, 2, 4, 5. The four clusters are visualized as separate as 
possible. A p-pod of variable vectors is shown. Each ray of the p-pod represents 
the projection of a variable axis whose length represents the maximum value of 
the variable. 



projections in locating interesting projections, compare Fig. 0 to Fig. 0 where 
we display a scatterplot of one of the attributes “industrial — proportions of 
non-retail business acres” against two of the other attributes “minority” and 
“ages of units.” Unlike the scatterplot, the 3D cluster-guided projections reveal 
significant information about the positions of the clusters. 



4 Rendering inside the CAVE Virtual Environment 

CAVE is a projection-based virtual reality environment which uses 3D computer 
graphics and position tracking to immerse users inside a 3D space. The CAVE 
in IHPC has a 10 x 10 X 10 feet room-like physical space. Stereographic images 
are rear projected onto three side walls and front projected onto the floor. The 
four projected images are driven by 2 InfiniteReality graphics pipelines inside 
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an SGI Onyx2 computer. The illusion of 3D is created through the use of LCD 
shutter glasses which are synchronized to the computer display through infrared 
emitters alternating the left and the right eye viewpoints. The CAVE allows 
multiple viewers to enter the CAVE and share the same virtual experience. But 
only one viewer can have the position/orientation of his/her head and hand 
captured. 

With the CAVE as a 3D “magic canvas”, 3D projection of high dimensional 
data is rendered as a galaxy in mid-air in the virtual space(Figure |^. The 
projection can be reshaped, moved back and forth, and rotated by using a wand 
(a 3D mouse). Each data point is painted as a sphere with its color representing 
the cluster it belongs. Spheres can be resized, and the speed of motion can be 
manually controlled anytime during a tour by adjusting an X-Y sensor attached 
on the wand. For easy identification, cluster centroids are painted as big cubs and 
labeled with cluster names. The variable vectors, which show the contribution 
to the projection of each variable, are visualized as lines in white color from 
the origin and marked by the names of variables at their far ends. There are 
two different ways of interactive picking: brushing with a resizable sphere brush; 
and cluster-picking by selecting a cluster’s centroid. The CAVE has plenty of 
space for data rendering. At some future time, we may have multiple viewing 
projections synchronized and displayed simultaneously. 

5 Add-On Features 

5.1 Where We Are in a Tour? 

A dizzy feeling besets many first-time viewers of high-dimensional data projec- 
tions and they may ask “How do I know what I am looking at”. In geometric 
terms, the task is to locate the position of a projection 3-frame in p-space. A 
visual way of conveying this information is to project the variable unit vectors 
in p-space like regular data, and render the result together with data points. 

Examples of the application are shown through the Figures 0-0 A generalized 
tripod called “p-pod” is an enhanced rendition of the p variable unit vectors in 
p-space. Variable vectors in the p-pod can be treated as if they were real data, 
rendered as lines, and labeled by variable names in the far ends so that they 
are recognized as guide posts rather than data. In the figures, we choose the 
maximum value rather than the unit value of a variable as the length of its 
variable vector. The p-pod looks like a star with p unequal rays in 3D space, 
each indicating the contribution of a variable to the current projection. 

5.2 Interactive Picking and Drill-Down 

An advantage of grand tour is that an viewer can easily keep track of the move- 
ment of a certain group of data points during the whole journey of a tour. A 
cluster, or a set of data points, could be picked up by pointing to the cluster 
centroid or using a brushing tool. Data points picked up so far can be related 
back to the data, thus makes it possible for further analysis such as launching 
another mining process for drill down. 
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Fig. 5. This is a cluster-guided projection in the CAVE. The wireframe box 
indicates a 3D room where data points are plotted in mid-air. The cluster-guided 
projection is determined by centroids of Clusters 1, 2, 4, and 5. 



5.3 Cluster Similarity Graphs 

3D cluster-guided projection is continuous transformation of data. Two points 
which are close in Rp will remain close after projection. However, two points 
which are close in a 3D projection need not be close in RP. There is a loss of 
information in projecting high-dimensional data to low-dimensions. To some- 
what mitigate this information loss, we use cluster similarity graphs H as an 
enhancement to cluster-guided projection. 

A cluster similarity graph can be defined as follows. Let vertices be a set of 
cluster centroids {Cj}j^i, and add an edge between two vertices Ci and Cj if 
d{Ci, Cj) < t, where t is a user-controlled threshold. If t is very large, all cluster 
centroids will be connected. If t is very small, no cluster centroids will be con- 
nected. It is thus intuitively clear that changing the threshold value will reveal 
distances among cluster centroids. The cluster similarity graph can be overlaid 
onto the projections. For example, straight lines connecting the cluster centroids 
in the Fig. E| represent a cluster similarity graph at a certain threshold. It can 
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Fig. 6. An similarity graph adds yet another information dimension to cluster- 
guided projections. 



be seen that the Clusters 0, 1, 3 are close to each other, among which Cluster 
3 is close to Cluster 4 which is close to Cluster 2. Cluster 5 is a standalone 
cluster from all others. The cluster similarity graph adds yet another informa- 
tion dimension to cluster-guided projections, and hence, enhances the viewing 
experience. 

6 Conclusion and Future Work 

This paper discussed the use of 3D projections and grand tour to visualize higher 
dimensional data sets. This creates an illusion of smooth motion through a mul- 
tidimensional space. The 3D cluster-guided tour is proposed to visualize data 
clusters. Cluster-guided tour preserves distances between cluster centroids. This 
allows us to fully capture the inter-cluster structure of complex multidimen- 
sional data. The use of the CAVE immersive virtual environment maximizes 
the chance of finding interesting patterns. Add-on features and interaction tools 
invite viewer’s interaction with data. 

The cluster-guided tour is a way to use data mining as a driver for visual- 
ization: Clustering identifies homogenous sub-populations of data, and the sub- 
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populations are used to help design the path of tour. This method can also be 
applied to the results generated by other data mining techniques, for instance, to 
identify the significant rules produced by tree classification and rule induction. 
All these are possible ways to allow a user to better understand both results of 
mining and data at hand. 

One important thing about an algorithm is its scalability. Grand tour scales 
well to large data sets. Its computational complexity is linear to the number of 
variables. The number of variables matters only in calculating projections, i.e. 
dot products, which has a linear complexity to the dimensionality of arguments. 
There are two major steps in grand tour, calculating a tour path and making 
projections. Calculating a tour path is nothing with the total number of data 
points. Making projections has a computational complexity linear to the number 
of data points. This is in the sense that all data points have to be projected one 
by one. For large data sets, this complexity can be greatly reduced by making 
density map instead of drawing points. 

The following directions is being explored or will be explored in the future: 

~ Working with categorical variables. In relational databases it is quite com- 
mon for many of the variables to be categorical rather than numerical. A 
categorical variable can be mapped onto a linear scatterplot axis in the same 
way as a numeric variable, provided that some order of distinct values of that 
variable is given along the categorical axis. Categorical values may be ex- 
plicitly listed. The order of the values being listed will be the order these 
values be arranged on the axis. Categorical values could be grouped together, 
reflecting the natural taxonomy of values. Categorical values could also be 
sorted alphabetically, numerically by weight, or numerically by aggregate 
value of some other variable. We are working on having categorical variables 
involved in a tour, and some results may come up soon. 

— 3D density projection and volume rendering. Scatterplot loses its effective- 
ness as the number of points becomes very large. It has also a drawback that 
identical data records may coincide with each other. For a tradeoff between 
computational complexity, comprehensibility and accuracy, we plan to use 
dynamic projections of high dimensional density map as a model to visualize 
data sets which contain large number of data points. 3D density projection 
is important to study, especially when clusters are not balanced in size and 
when clusters overlap with each other. Research is now on finding solutions 
of problems such as: how to store the sparse, voxelized high dimensional data 
more efficiently; and how to fast render a volume of high dimensional voxels 
onto the projected 3-dimensional space. 

— Parallel implementation for better performance. A parallel implementation 
is necessary for the rendering of very large data sets. Since data points are 
independently projected, it should be quite straightforward to parallelize 
the code, for instance, by using multithreads on a shared memory machine. 
Since our CAVE’s backend computer, the SGI Onyx2, is quite busy with 
CAVE display, leaving few resources for projection calculation, a client-server 
implementation is also necessary. This will be done through a high speed 
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network connection to a more powerful SGI 0rigin2000. All projection data 
will be calculated on the server and sent in real time to the CAVE. One 
interesting issue here is how to transfer only the necessary projected data to 
the CAVE in order that the transferred data can be directly rendered. 
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Abstract. In many classification problems the domains of the attributes 
and the classes are linearly orderded. For such problems the classifica- 
tion rule often needs to be order-preserving or monotone as we call it. 
Since the known decision tree methods generate non-monotone trees, 
these methods are not suitable for monotone classification problems. We 
provide an order-preserving tree-generation algorithm for multi-attribute 
classification problems with k linearly ordered classes, and an algorithm 
for repairing non-monotone decision trees. The performance of these al- 
gorithms is tested on random monotone datasets. 



1 Introduction 

Ordinal classification refers to an important category of real-world problems, in 
which the attributes of the objects to be classified and the classes are ordered. 
For this class of problems classification rules often need to be order-preserving. In 
that case we have a monotone classification problem. In this paper we study the 
problem of generating decision-tree-classifiers for monotone classification prob- 
lems: the attributes and the set of classes are linearly ordered. Ordinal classifica- 
tion for multi-attribute decision making has been studied recently by Ben-David 
)1 1'JIJj for discrete domains, and by Makino et al. |S| for the two-class problem 
with continuous attributes. However, although the tree-generation method of 
Ben-David accounts for the ordering of the attributes and of the classes, or- 
der preserving is not guaranteed. Furthermore, the method of Makino et al. is 
restricted to the two-class problem. In this paper we propose a tree growing 
algorithm for the fc-class problem that guarantees to induce monotone trees. 
In addition, we provide an algorithm that repairs non-monotone decision trees. 
These algorithms are also studied in our PhD-dissertation 0 and a technical 
report |0| in which we provide several algorithms for monotone classification 
problems with k classes and discrete or continuous domains. All proofs of the 
results of this paper will also be found in PEI- 

For motivation and examples of real world monotone classification problems 
we refer to PEDI- Here we give a simple example of such a problem. Suppose 
a bank wants to base its loan policy on a number of features of its clients, for 
instance on income, education level and criminal record. If a client is granted 
a loan, it can be one in three classes: low, intermediate and high. So, together 
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Table 1. The bank loan dataset 



client 


income 


education 


crim. record 


loan 


cll 


low 


low 


fair 


no 


cl2 


low 


low 


excellent 


low 


cl3 


average 


intermediate excellent 


intermediate 


cl4 


high 


low 


excellent 


intermediate 


cl5 


high 


intermediate excellent 


high 



with the no loan option, we have four classes. Suppose further that the bank 
wants to base its loan policy on a number of credit worthiness decisions in the 
past. These past decisions are given in Table Q A client with features at least as 
high as those of another client may expect to get at least as high a loan as the 
other client. So, finding a loan policy compatible with past decisions amounts 
to solving a monotone classification problem with the dataset of Table [D In this 
paper we only discuss the main algorithm for discrete domains. In a companion 
paper [71 on quasi-monotone decision trees we also discuss continuous domains 
and the problem of dealing with noise. 

2 Monotone Classification 

In this paper we will assume that our input space ft is a coordinate space. 
Elements of ft will be vectors (xi, . . . ,x„) with coordinates xi which will take 
their values from a finite linearly ordered space Xi,i = l,...,n. Without loss 
of generality we may assume that for 1 < i < n,Xi = {0,l,...,ni} for some 
integer Here the order relation < on ft is defined as x < y iS Xi < pi for all 
i = 1, . . . ,n. This order relation is a partial ordering of the space ft. Of course, 
this includes the very common situation that our examples are measurements on 
n variables Ai, . . . , A„, where the individual measurement on variable Xi yields 
a value Xi from an ordered set ft^. So each of the variables may take its values 
from a different set, as long as all these coordinate sets are linearly ordered. 

Next, let C be a finite linearly ordered set of classes, with linear ordering <. 
A classification rule or class labeling is a function A : ft — > C which assigns a class 
from C to every point in the input space ft. The minimal and maximal elements 
of C will be denoted by Cmin and Cmax respectively. A classification problem is the 
problem of finding a class labeling A that satisfies certain constraints conditions, 
to be specified in the problem description. One possible constraint is that the 
labeling A be monotone: a monotone classification rule is a function A : A — > C 
for which 

X < y ^ X{x) < \{y) (1) 

for all points x,y G X . 

A very common classification problem occurs, when there is a dataset or set 
of examples available. The usual constraint to be met in such a situation is that 
the classification rule one is looking for should correctly classify all examples in 
the dataset. With this situation we will deal in the sequel. 
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A dataset is a finite collection of examples from the input space, together 
with a class labeling of all these examples. Formally, we define a dataset as 
follows: 

Definition 1 A dataset I? is a pair {D, A) where D C A is a finite subset of 
the input space X and X : D ^ C is a class labeling of the elements of D. The 
elements of D will be called the examples of the dataset. 

Note first of all that the class labeling A of a dataset T> = {D, A) is not a 
classification rule: it is only defined on D, a subset of A, while a classification 
rule must be defined on all elements of the input space A. Secondly, we do not 
allow an example to have two or more different classes: all elements of the dataset 
must be consistently labeled. 

Given a dataset T> = (D, A) we can try to solve the corresponding monotone 
classification problem of finding a monotone classification rule A : A — > C that 
extends the class labeling A of the dataset T> to the entire input space A. Thus, 
X{x) = X{x) for all a; G D. Obviously, if one wants to find a solution for such a 
monotone classification problem, the dataset itself has to be monotone: 

Definition 2 A dataset T> = {D, A) is called monotone if the implication (P) 
holds for all a;, 2 / G D. 

In order to save space we will often map the values of the attributes of a 
dataset to a set of numbers. For instance. Tabled could be written as 



Xi Aa As 


C 


0 


0 


1 


0 


0 


0 


2 


1 


1 


1 


2 


2 


2 


0 


2 


2 


2 


1 


2 


3 



when we use the mapping low ^ 0, average ^ 1, high ^ 2 for feature Xi = 
income^ etc. We will even write concisely T> = {001:0, 002:1, 112:2, 202:2, 212:3} 
for the above dataset. 

As noted above the problem of finding a solution to a monotone classification 
problem amounts to finding a monotone extension A of the class labeling A of a 
dataset T> — (D, A). Formally, a function A : A — > C is an extension of A : A — > C, 
if the restriction of A to D i.e. X\D — A. Or, if A(a:) = A(a;) for all x G D. If 
T> = {D, A) is monotone, we denote the collection of all monotone extensions of 
A with A{'D). Note, that for classification rules A, A' G A{T>) we mean by A < A' 
that A(a:) < A'(a;) for all x G X. A{T>) is partially ordered by this order relation 
<. We will now define two special elements of this collection A{T>). 
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Definition 3 If I? = {D, A) is a monotone dataset, we define > C, and 

"^max ■ X ^ C, as follows: for all a; G 






max{A(?/) '■ y & D,y < x} \i x>y ior some y G D 
Cniin otherwise 



and 



A® 



(x) 



min{A(y) '■ y G D,y > x} if x < y for some y G D 
Cmax otherwise. 



The next lemma shows that the functions A®;„ and A^^xi defined, are the 
minimal resp. maximal elements of A(T>). 



Lemma IfT>= {D, A) is a monotone dataset, for the functions A®j,^ and 
the following statements hold: 

0) "^min> •^Sax ^ 

(a) A{T>) = {A : A®;„ < A < A^^^ o,nd A monotone}. 



Theoretically, we now have at least two solutions for a monotone classification 
problem with dataset T> = (D,X): the minimal and maximal extension of A. 
These two classification rules we will call the minimal rule and the maximal rule 
respectively. In addition we have for every point x in the input space bounds 
that any rule A must satisfy: 

A®i„(x) < A(x) < A®,„,(x). 

Any monotone classification rule that satisfies these bounds will be another 
solution to our problem. 

3 Induction of Monotone Decision Trees 

From now on we will require the representation of our classification rule to have 
a specific form, viz. the form of a classification tree or decision tree. In this paper 
we will only consider univariate binary decision trees. However, we do consider 
non-binary trees in |B| . For univariate binary trees, at each node a split is made 
using a test of the form Xi < c for some c G Xi,l < i < n. Thus, in each node 
the associated seiT G X is split into the two subsets = {x G T : Xi < c} and 
Tr = {x GT ■. Xi> c}. 

It is easily shown that each subset T associated with a node or leaf can 
be written in the form T = {x G A : a < x < 6} for some a,b G X. We shall use 
the notation T = [a, b] for a subset of this form. 

We shall now show how we can generate from a data set T> a binary decision 
tree T. This process is also called inducing a binary decision tree T from a 
dataset V. An algorithm for the induction of a decision tree T from a dataset 
T> contains the following ingredients: 
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tree(A’, Vo): 
split (A, Vo) 

split (r,var V): 

V := update(D, T); 
if H{T, V) then 

assign class label C{T, V) to leaf T 
else 
begin 

(Tl,Tr) :=S{T,V)- 
split {Tl,V)- 
split (Tr,V) 

end 

Fig. 1. Monotone Tree Induction Algorithm 



— a splitting rule S : defines the way to generate a split in each node, 

— a stopping rule Ti.: determines when to stop splitting and form a leaf, 

— a labeling rule C: assigns a class label to a leaf when it is decided to create 

one. 

If 5, Ti. and C have been specified, then an induction algorithm according to 
these rules can be recursively described as in Figure Q 

In this algorithm outline there is one aspect that we have not mentioned yet: 
the update rule. In the algorithm we use, we shall allow the dataset to be updated 
at various moments during tree generation. During this process of updating 
we will incorporate in the dataset knowledge that is needed to guarantee the 
monotonicity of the resulting tree. 

Note, that V must be passed to the split procedure as a variable parameter, 
since T> is updated during execution of the procedure. 

As noted in the beginning of this section, we only need to specify a splitting 
rule, a stopping rule, a labeling rule and an update rule. Together these are then 
plugged into the algorithm of Figure 2 to give a complete description of the 
algorithm under consideration. Note that each node T to be split or to be made 
into a leaf has the form T = [a, b] for some a,b G X. 

We start with describing the update rule. When this rule fires, the dataset 
T> = {D, A) will be updated. In our algorithm at most two elements will be added 
to the dataset, each time the update rule fires. Recall, that because T is of the 
form T = [a, 6], a is the minimal element of T and b is the maximal element of 
T. Now, either a or 6, or both will be added to D, provided with a well-chosen 
labeling. If a and b both already belong to D, nothing changes. The complete 
update rule is displayed in Figure |21 

The splitting rule S{T,T>) must be such that at each node the associated 
subset T is split into two nonempty subsets 



S{T,T>) = {Tl,Tr) with Tr = {x G T : Xi < c} and Tr = {x G T : xi > c} (2) 
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update (var T),T)-. 

if a ^ D then 
begin 

D := Du {a}; 
A(u) . 



end; 

if ti ^ D then 
begin 

D — DU {6}; 

m ~ XZnib) 



end 



Fig. 2. The Update Rule of the Standard Algorithm 



for some i G {1, . . . , n}, and some c G A’i. Note, that because of the assumption 
in section 2, Tr can also be written as Tji = {x G T : Xi > c'} for some c' G Xi. 
Furthermore, the splitting rule must satisfy the following requirement: i and c 
must be chosen such that 



3a;, y G D C\T with A(a;) ^ \{y),x G Tr and y € Tr. (3) 



Next, we consider the stopping rule Ti.{T,'D). As a result of the actions of the 
update rule, both the minimal element a and the maximal element b of T belong 
to D. Now, as a stopping rule we will use: 



n{T,V) 



true if A(a) = A(6), 
false otherwise. 



( 4 ) 



Finally, the labeling rule C{T,T>) will be simply: 



C{T,V) = A(a) = X{b). 



( 5 ) 



Now we can formulate the main theorem of this paper. 



Theorem If D = {D, A) is a monotone dataset on input space X and if the 
functions 5, , C satisfy 0, (0, 0 and 0 , then the algorithm specified in 
Figure^ and\^ will generate a monotone decision tree T with Xj- G A(D) . 



Note, that this theorem actually proves a whole class of algorithms to be 
correct: the requirements set by the theorem to the splitting rule are quite gen- 
eral. Nothing is said in the requirements about how to select the attribute Xi 
and how to calculate the cut-off point c for a test of the form t = {Xi < c}. 
Obvious candidates for attribute-selection and cut-off point calculation are the 
well-known impurity measures like entropy, Gini or the twoing rule, see 0. 

A useful variation of the above algorithm is the following. We change the 
update rule to 
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update (var D,T)\ 

if T is homogeneous then 
begin 

body of update procedure of Figure El 

end 



Fig. 3. Update Rule of the Repairing Algorithm 




Fig. 4. Non-monotone Decision Tree 



thus, only adding the corner-elements to the dataset if the node T is homoge- 
neous, i.e. if Va;,j/ G D DT : A(a::) = X{y). If T is homogeneous, we will use the 
notation At for the common value A(a;) of all x G D C] T. The stopping rule 
becomes: Ti{T,'D) = true, if T is homogeneous and A(a) = A(6), and false oth- 
erwise; and the labeling rule: £{T,T>) = A(a) = \{b) = At- With these changes 
the theorem remains true as can be easily seen. However, whereas with the stan- 
dard algorithm from the beginning one works at ’monotonizing’ the tree, this 
algorithm starts adding corner elements only when it has found a homogeneous 
node. For instance, if one uses maximal decrease of entropy as a measure of the 
performance of a test-split t = {Xi < c}, this new algorithm is equal to Quin- 
lan’s C4. 5-algorithm, until one hits upon a homogeneous node; from then on our 
algorithm starts adding the corner elements a and b to the dataset, enlarging 
the tree somewhat, but making it monotone. We call this process cornering. 
Thus, this algorithm can be seen as a method that first builds a traditional 
(non-monotone) tree with a method such as C4.5 or CART, and next makes it 
monotone by adding corner elements to the dataset. This observation yields also 
the possible use of this variant: if one has an arbitrary (non- monotone) tree for 
a monotone classification problem, it can be ’repaired’ i.e. made monotone by 
adding corner elements to the leaves and growing some more branches where 
necessary. 

As an example of the use of this repairing algorithm, suppose we have the fol- 
lowing monotone dataset V = {000:0, 001:1, 100:0, 110:1}. Suppose further, that 
someone hands us the following decision tree for classifying the above dataset: 
This tree indeed classifies T> correctly, but although T> is monotone, the tree is 
not. In fact, it classifies data element 001 as belonging to class 1 and 101 as 0. 
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Fig. 5. The above tree, but repaired 



Clearly, this conflicts with monotonicity rule (P). To correct the above tree, we 
apply the algorithm of Figure 0 to it. We add the maximal element of the third 
leaf 101 to the dataset with the value A®;„(101) = 1. The leaf is subsequently 
split and the resulting tree is easily found to be monotone, see Figure El Of 
course, if we would have grown a tree directly for the above dataset T> with the 
Standard Algorithm we would have ended up with a very small tree with only 
three leaves. 



4 Example 

In this section we will use the presented Standard Algorithm to generate a mono- 
tone decision tree for the dataset of Table [D As an impurity criterium we will use 
entropy, see (2|. Starting in the root, we have T = X, so a = 000 and b = 222. 
Now, A®,^,;(000) = 0 and A®;„(222) = 3, so the elements 000:0 and 222:3 are 
added to the dataset, which then consists of 7 examples. 

Next, six possible splits are considered: Xi < 0,Ai < < 0,X2 < 

1, As < 0 and A3 < 1. For each of these possible splits we calculate the decrease 
in entropy as follows. For the test Ai < 0, the space X = [000, 222] is split 
into the subset = [000,022] and Tr = [100,222]. Since Tr contains three 
data elements and Tr contains the remaining four, the average entropy of the 
split is I X 0.92 -I- I X 1 = 0.97. Thus, the decrease in entropy for this split is 
1.92 — 0.97 = 0.95. When calculated for all six splits, the split Ai < 0 gives the 
largest decrease in entropy, so it is used as the first split in the tree. 

Proceeding with the left node T = [000,022] we start by calculating 
A®;,, (022) = 1 and adding the element 022:1 to the dataset T>, which will then 
have eight elements. We then consider the four possible splits X 2 < 0,A2 < 
1,A3 < 0 and A3 < 1, of which the last one gives the largest decrease in 
entropy, and leads to the nodes Tr = [000,021] and Tr = [002,022]. Since 
A®in(021) = 0 = A(OOO), Tl is made into a leaf with class 0. 
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Fig. 6. Monotone Decision Tree for the Bank Loan Dataset 



Proceeding in this manner we end up with the decision tree of Figure El which 
is easily checked to be monotone. 

5 Experiments 

We did some experiments to get an idea of the usefulness of our methods and 
to compare them with those of Ben-Da, vid jll‘il,'S] and Makino et al. 0. First 
we did some experiments to investigate the size of the trees that our methods 
would generate, also in comparison with other methods. We generated random 
monotone datasets with 10, 20, 30, etc. examples and built trees with each of 
those datasets, using three different methods: C4.5 as a general method, which 
does not generate monotone trees, and two methods presented in this paper: 
MTl is the repairing method of Figure 0 MT2 is the main method of Section 3 
which we called the Standard Algorithm. As an aside, we use the abbreviation 
MT for Monotone Tree. In all experiments we used entropy as the impurity 



Table 2. Size of trees: Number of Leaves 



examples 


C4.5 


MTl 


MT2 


10 


7.0 


14.4 


7.8 


20 


12.2 


30.2 


18.2 


30 


17.4 


42.6 


30.2 


40 


21.8 


42.2 


36.0 


50 


30.2 


53.2 


43.2 


60 


30.8 


54.2 


44.2 


70 


38.6 


59.6 


50.8 


80 


43.4 


69.0 


63.6 


90 


47.8 


69.6 


63.8 


100 


56.2 


78.0 


66.2 


150 


78.2 


92.8 


89.2 
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Table 3. Size of Trees: Average Path Length (left) and Expected Number of 
Comparisons Needed (right) 



examples 


C4.5 


MTl 


MT2 


examples 


C4.5 


OLM 


MTl 


MT2 


10 


3.2 


4.5 


3.3 


10 


2.6 


6.8 


3.1 


2.7 


20 


4.0 


5.7 


4.8 


20 


3.5 


11.7 


4.5 


4.1 


30 


4.6 


6.0 


5.5 


30 


4.0 


15.2 


4.9 


4.7 


40 


4.9 


6.0 


6.0 


40 


4.3 


18.2 


4.9 


4.7 


50 


5.5 


6.4 


6.1 


50 


4.9 


21.3 


5.5 


5.3 


60 


5.6 


6.7 


6.4 


60 


4.8 


24.4 


5.4 


5.2 


70 


6.1 


6.8 


6.4 


70 


5.1 


26.0 


5.6 


5.5 


80 


6.1 


6.8 


6.7 


80 


5.4 


27.2 


5.9 


5.8 


90 


6.2 


6.9 


6.7 


90 


5.5 


28.3 


5.9 


5.8 


100 


6.4 


6.9 


6.7 


100 


5.7 


33.4 


6.1 


5.9 


150 


6.9 


7.2 


7.2 


150 


6.2 


33.9 


6.4 


6.4 



measure. For each number of examples we generated five different datasets, each 
from a universe with 5 attributes, each having 3 possible values, while all data 
elements where evenly divided over 4 classes. The results for the number of leaves 
of the generated trees are shown in Table 2. The figures shown are averages over 
the five datasets. 

The size of a tree can also be measured by looking at the depth of a tree. 
One way to measure this depth is the average path length: the average length of 
a path from the root of the tree to a leaf. For instance, the average path length 
of the tree of Figure Elis 2.4 since there are three paths of length 2 and two 
of length 3. Table 3(left) shows the results of our measurements, where the size 
of the generated trees is measured in average path length. Another measure of 
the depth of a tree is the expected number of comparisons needed to classify 
an arbitrary new example presented to the tree. If Ti, . . . , T/j are the leaves of a 
tree, this measure can be calculated as 

M 



k 



Expected Number of Comparisons Needed = 



i=l 



where ii is the length of the path from the root to the leaf Ti. One advantage of 
this method of measuring the size of a tree is, that it can also be applied to a 
non-tree method such as OLM Q, where a new example also must be compared 
with a number of elements of the OLM-database. Thus, this last measure is also 
a measure of the efficiency of a classifier at classifying new examples. The results 
are shown in Table 3 (right). 

Thus, although both OLM and our decision tree methods MTl and MT2 
produce genuinely monotone classification rules, the decision tree methods ap- 
pear to be much more efficient in classifying new examples. Of course, for OLM, 
the initial production of a classifier costs only a small fraction of the time it 
costs to build a new MTl or MT2 tree, since OLM is only slightly more than a 
case-based system. However, when the classifiers are actually used, the situation 
is reversed, and our methods are superior. As a second experiment we did an 
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Table 4. Percentage Correctly Classified in 3-fold Cross Validation experiments 



examples 


C4.5 


OLM 


MTl 


MT2 


10 


37.2 


38.9 


51.1 


51.1 


20 


22.9 


34.6 


33.5 


35.2 


30 


35.0 


54.0 


46.7 


44.7 


40 


51.0 


53.3 


54.8 


53.0 


50 


32.0 


46.7 


48.4 


48.8 


60 


48.0 


50.0 


56.0 


57.0 


70 


48.4 


55.4 


56.5 


56.0 


80 


35.9 


49.9 


51.7 


48.7 


90 


55.4 


56.7 


62.7 


61.8 


100 


48.4 


56.0 


61.4 


59.0 


Average 


41.5 


49.6 


52.3 


51.5 



Table 5. Comparison with Makino et al. 



examples 


# leaves 
Mak MTl MT2 


average depth 
Mak MTl MT2 


speed 

Mak MTl MT2 


10 


5.8 


12.2 


8.0 


3.0 


4.1 


3.2 


0.4 


1.0 


0.6 


20 


7.0 


14.6 


10.6 


3.2 


4.4 


4.0 


0.8 


1.2 


0.8 


30 


13.0 


19.2 


15.2 


4.2 


5.1 


4.5 


1.8 


1.2 


1.8 


40 


15.6 


21.0 


18.8 


4.8 


5.4 


5.1 


2.4 


2.0 


2.4 


50 


13.6 


18.8 


17.6 


4.3 


4.9 


4.8 


2.4 


1.6 


2.4 


60 


19.4 


23.0 


20.8 


5.1 


5.4 


5.0 


3.6 


2.0 


3.2 


70 


22.4 


26.0 


22.2 


5.3 


5.6 


5.2 


4.6 


2.4 


4.8 


80 


25.6 


31.6 


30.8 


5.6 


6.0 


5.6 


5.6 


3.4 


6.8 


90 


26.8 


32.4 


30.2 


5.7 


5.7 


5.6 


5.8 


3.6 


6.8 


100 


31.8 


32.8 


31.6 


5.9 


5.8 


5.7 


7.6 


4.4 


8.0 


150 


44.8 


46.2 


44.2 


6.3 


6.2 


6.1 


13.8 


6.4 


17.6 



attempt to investigate the generalizing power of the proposed methods. Again, 
we generated random monotone datasets of size 10, 20, etc. But now we used 
these datasets for 3-fold cross validation experiments. Each complete cross val- 
idation experiment was repeated four times with a different dataset. Thus, for 
each size and each method, we generated twelve different classifiers. The average 
percentage of correctly classified examples will be found in Table 4 for each of 
the five methods we tested. As a tentative result, it seems that our methods of 
Section 4 are slightly better in predicting a class for a new example than the 
other methods for these monotone problems. As a third and final experiment we 
wanted to compare our main methods with those of Makino et al. To do this we 
could only consider two class problems, since their method works only in that 
situation. Thus, we generated monotone datasets for two class problems with size 
10, 20, etc., we generated trees with Makino and our methods MTl and MT2, 
and we measured the size of the resulting trees, with the above three criteria. 
In addition, we measured the speed of the algorithm for generating the trees in 
seconds on our computer. The results are shown in Table 5. It appears that our 
algorithms MTl and MT2 in the 2-class situation generate trees of comparable 
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size, but our method MTl seems to be faster than the method of Makino et al. 
and MT2. 

6 Conclusion and Further Remarks 

We have provided a tree generation algorithm for monotone classification prob- 
lems with discrete domains and k classes. This improves and extends results 
of Ben-David and Makino et al. |5I. This algorithm is to our knowledge 
the only method that guarantees to produce monotone decision trees for the k 
class problem. In addition, we show that our algorithm can be used to repair 
non-monotone decision trees that have been generated by other methods. We 
also discuss a number of experiments in order to test the performance of our 
algorithm for the fc-class problem and to compare it with other methods. Our 
methods turn out to be much more efficient at classifying new examples than the 
only known existing method for monotone classification (OLM). The accuracy 
of our methods is at least of the same order. In the special case of the two-class 
problem it appears that the results of our algorithm (speed and tree size) are 
comparable with those of |S|. For real world monotone classification problems it 
would also be interesting to generate trees with different degrees of monotonicity 
not only in case the data set is not monotone due to noise, but also in case the 
data set is monotone. In our companion paper 0 on quasi-monotone decision 
trees we relax the requirement of full monotonicity, thereby giving an improve- 
ment of the results w.r.t. tree size, speed and generalisation. In that paper we 
also deal with the problem of noise. 
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Abstract. This paper introduces a Bayesian method for clustering dy- 
namic processes and applies it to the characterization of the dynamics of 
a military scenario. The method models dynamics as Markov chains and 
then applies an agglomerative clustering procedure to discover the most 
probable set of clusters capturing the different dynamics. To increase 
efficiency, the method uses an entropy-based heuristic search strategy. 



1 Introduction 

An open problem in exploratory data analysis is to automatically construct 
explanations of data m- This paper takes a step toward automatic explanations 
of time series data. In particular, we show how to reduce a large batch of time 
series to a small number of clusters, where each cluster contains time series 
that have similar dynamics, thus simplifying the task of explaining the data. 
The method we propose in this paper is a Bayesian algorithm for clustering by 
dynamics. 

Suppose one has a set of univariate time series generated by one or more 
unknown processes, and the processes have characteristic dynamics. Clustering 
by dynamics is the problem of grouping time series into clusters so that the 
elements of each cluster have similar dynamics. For example, if a batch contains 
a time series of sistolic and diastolic phases, clustering by dynamics might find 
clusters corresponding to the pathologies of the heart. If the batch of time series 
represents sensory experiences of a mobile robot, clustering by dynamics might 
find clusters corresponding to abstractions of sensory inputs mg. 

Our algorithm learns Markov chain (mc) representations of the dynamics in 
the time series and then clusters similar time series to learn prototype dynamics. 
A MC represents a dynamic process as a transition probability matrix. For each 
time series observed on a variable X, we construct one such matrix. Each row in 
the matrix represents a state of the variable X, and the columns represent the 
probabilities of transition from that state to each other state of the variable on 
the next time step. The result is a set of conditional probability distributions, 
one for each state of the variable X, that can be learned from a time series. A 
transition matrix is learned for each time series in a training batch of time series. 
Next, a Bayesian clustering algorithm groups time series that produce similar 
transition probability matrices. 
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The main feature of our Bayesian clustering method is to regard the choice 
of clusters as a problem of Bayesian model selection and, by taking advantage of 
known results on Bayesian modeling of discrete variables |2], we provide closed 
form solutions for the evaluation of the likelihood of a given set of clusters and 
a heuristic entropy-based search. We note that recent work by Einiini has 
investigated modeling approaches to clustering dynamic process. 

While there are similarities between clustering by dynamics and learning 
Hidden Markov Models (HMMs), the former problem is different and somewhat 
simpler. An HMM has one probability distribution for the symbols emitted by 
each state, and also a matrix of probabilities of transitions between states E]g|. 
In our problem we fit a fully observable Markov model to each episode and 
then we search for the partition of these models into clusters that has maximum 
probability. In fact, in a related project, we developed an HMM approach to pro- 
cessing robot sensor data [^. When trained on a batch of multivariate time series 
of sensor data, our HMM method learns a machine that generally has several 
paths from initial to ending states. Each training series is modeled as a sequence 
of state transitions along one of these paths, and so series that follow the same 
state transition paths might be viewed as members of a cluster. But the cluster- 
ing is not based on overt similarity judgments, nor does the clustering satisfy any 
helpful properties, whereas clusters given by the technique in this paper consti- 
tute the maximum likelihood partition of sensor time series. In related work on 
clustering by dynamics, Oates has developed a method based on Dynamic Time 
Warping jSj. In this work, the “stretch” required to warp one multivariate time 
series into another is used as a similarity metric for agglomerative clustering. 
Because Dynamic Time Warping works on an entire time series, it is a good 
metric for comparing the shape of two series. The algorithm we discuss in this 
paper assumes the series are Markov chains, so clustering is based on the simi- 
larity of transition probability matrices, and some information about the shape 
of the series is lost. While there are undoubtedly applications where the shape 
of a time series is its most important feature, we find that Bayesian clustering of 
MCs produces meaningful groupings of time series, even in applications where the 
Markov assumption is not known to hold (see Section . Our algorithm is also 
very efficient and accurate, and it provides a way to include prior information 
about clusters and a heuristic search for the maximum likelihood clustering. 

The reminder of this paper is organized as follows. We first describe the 
scenario on which we apply our clustering algorithm. The Bayesian clustering 
algorithm is described in Sectional We apply the algorithm to a set of 81 time 
series generated in our application scenario and discuss the results in Section 0 



2 The Problem 

The domain of our application is a simulated military scenario. For this work, 
we employ the Abstract Force Simulator (afs) [Q, which has been under de- 
velopment at the University of Massachusetts for several years. AFS uses a set 
of abstract agents called blobs which are described by a small set of physical 
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features, including mass and velocity. A blob is an abstract unit; it could be 
an army, a soldier, a planet, or a political entity. Every blob has a small set of 
primitive actions that it can perform, primarily move and apply-force, to which 
more advanced actions, such as tactics in the military domain, can be added. 
AFS operates by iterating over all the units in a simulation at each clock tick and 
updates their properties and locations based on the forces acting on them. The 
physics of the world specifies probabilistically the outcomes of unit interactions. 
By changing the physics of the simulator, a military domain was created for this 
work. 



Table 1. The tasks given to each blob in the scenario. 





Blob 


Task 


Primary 

Effort 


Red 2 


retain 

objective Red Flag 


Blue 2 


attack 

objective Red Flag 


Supporting 

Effort 


Red 1 


attack 
blob Blue 1 


Blue 1 


escort 
blob Blue 1 



The time series that we want to analyze come from a simple 2-on-2 Capture 
the Flag scenario. In this scenario, the blue team. Blue 1 and Blue 2, attempt 
to capture the objective Red Flag. Defending the objective is the red team. Red 
1 and Red 2. The red team must defend the objective for 125 time steps. If the 
objective has not been captured by the 125*^ time step, the trial is ended and 
the red team is awarded a victory. The choice of goals and the number of blobs 
on each team provide a simple scenario. Each blob is given a task (or tactic) to 
follow and it will attempt to fulfill the task until it is destroyed or the simulation 
ends (Table [Q). 

In this domain, retaining requires the blob to maintain a position near the 
object of the retain — the Red Flag in this example — and protect it from 
the enemy team. When an enemy blob comes within a certain proximity of the 
object of the retain, the retaining blob will attack it. Escorting requires the blob 
to maintain a position close to the escorted blob and to attack any enemy blob 
that comes within a certain proximity of the escorted blob. Attacking requires 
the blob to engage the object of the attack without regard to its own state. These 
tactics remain constant over all trials, but vary in the way they are carried out 
based on environmental conditions such as mass, velocity and distance of friendly 
and enemy units. To add further variety to the trials, there are three initial mass 
values that a blob can be given. With four blobs, there are 81 combinations of 
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Table 2. Univariate representation of the scenario. 



State # 


State Description 


Notes 


0 


{FI, FFR+, CFR+) 


Strong Red 


1 


{FI, FFR+, CFR-) 




2 


{FI, FFR-, CFR+) 




3 


{FI, FFR-, CFR-) 




4 


{F2, FFR+, CFR+) 


Strong Red 


5 


{F2, FFR+, CFR-) 




6 


{F2, FFR-, CFR+) 




7 


{F2, FFR-, CFR-) 


Strong Blue 


8 


{F3, FFR+, CFR+) 




9 


{F3, FFR+, CFR-) 




10 


{F3, FFR-, CFR+) 




11 


{F3, FFR-, CFR-) 


Strong Blue 



these three mass values. At the end of each trial, one of three ending conditions 
is true: 

A The trial ends in less than 125 time steps and the blue team captures the 
flag. 

B The trials ends in less than 125 time steps and the blue team is destroyed. 
C The trial is stopped at the 125*^ time step and the blue fails to complete its 
goal. 

To capture the dynamics of the trials, we chose to define our state space in 
terms of the number of units engaged and force ratios. There are three possible 
engagement states at each time step. Red has more blobs “free” or unengaged 
{FI), both blue and red have an equal number of unengaged blobs (F2), or blue 
has more unengaged blobs {F3). In each of these states, either the red team or 
the blue team has more unengaged mass {FFR+ or FFR- respectively). In each 
of the six possible combinations of the above states, either red or blue has more 
cumulative mass {CFR+ or CFR- respectively). Altogether there are 12 possible 
world states, as shown in Table 0 The table shows states 0 and 4 to be especially 
advantageous for red and states 7 and 11 to be favorable to blue. 

In the next section, we represent this set as the states of a univariate variable 
X, and show how to model the dynamics of each trial and then cluster trials 
having similar dynamics. 

3 Clustering Markov Chains 

We describe the algorithm in general terms. Suppose we have a batch of m time 
series, recording values of a variable X taking values l,2,...,s. We model the 
dynamics of each trial as a MC. For each time series, we estimate a transition 
matrix from data and then we cluster transition matrices with similar dynamics. 



Discovering Dynamics Using Bayesian Clustering 203 



3.1 Learning Markov Chains 

Suppose we observe a time series x = {xQ,xi,X2T--,Xi-i,Xi, The process 
generating the sequence a; is a MC if p{X = xt\{xo,x\^X2, — p{X = 

Xt\xt-i) for any x* in a; |^. Let Xt be the variable representing the variable 
values at time t, then Xt is conditionally independent of Xq,Xi, ...,Xt-2 given 
Xt-i- This conditional independence assumption allows us to represent a MC as 
a vector of probabilities po = {poi,Po 2 , ■■■,Pos), denoting the distribution of Xq 
( the initial state of the chain) and a matrix of transition probabilities 



Xt-i 


Xt 

1 2 •• 


• s 


1 


Pll Pl2 • ■ 


■ Pis 


2 


P21 P22 • • 


■ P2s 


s 


Psl Ps2 ■ ■ 


■ Pss 



where pij = p(Xt = j\Xt-i = i). Given a time series generated from a MC, we 
can estimate the probabilities pij from the data and store them in the matrix 
P. The assumption that the generating process is a MC implies that only pairs 
of transitions Xt-i = i Xt = j are informative, where a transition Xt-\ = 
i ^ Xt = j occurs when we observe the pair Xt-\ = i,Xt = j in the time 
series. Hence, the time series can be summarized into an s x s contingency table 
containing the frequencies of transitions riij = n(i — > j) where, for simplicity, 
we denote the transition Xt-\ = i ^ Xt = j hy i ^ j . The frequencies ntj are 
used to estimate the transition probabilities pij characterizing the dynamics of 
the process that generated the data. 

However, the observed transition frequencies may not be the only source 
of information about the process dynamics. We may also have some background 
knowledge that can be represented in terms of a hypothetical time series of length 
a + 1 in which the a transitions are divided into Oy transitions of type i ^ j. 
This background knowledge gives rise to a s x s contingency table, homologous 
to the frequency table, containing these hypothetical transitions Oy that we call 
hyper-parameters. 

A Bayesian estimation of the probabilities Py takes into account this prior 
information by augmenting the observed frequencies ny by the hyper-parameters 
ttij so that the Bayesian estimate of Py is 



Oy “t“ Uy 

Qfj -I- Ui 



( 1 ) 



where at = Oy and Ui = Uij . Thus, at and ni are the numbers of 
times the variable X visits state i in a process consisting of a and n transitions, 
respectively. Formally, the derivation of Equation [flis done by assuming Bayesian 
conjugate analysis with Dirichlet priors on the unknown probabilities Py . Further 
details are in By writing Equation das 
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„ C^ij Oii Tlij Tli . . 

Pa = — 1 — (2) 

ai ai + n* n* a* + Ui 

we see that pij is an average of the classical estimate riij j rn and of the quantity 
aijjai, with weights depending on ai and rii. Rewriting of Equation ^asl^lshows 
that aij j ai is the estimate of Py when the data set does not contain transitions 
from the state i — and hence Uy = 0 for all j — and it is therefore called the 
prior estimate of Py , while pij is called the posterior estimate. The variance of 
the prior estimate Oy / ai is given by (oy /oi) (1 — Oy /on)/ {ai + 1) and, for fixed 
otij / ai., the variance is a decreasing function of ai. Since small variance implies 
a large precision about the estimate, ai is called the local precision about the 
conditional distribution Xt\Xt-i = i and it indicates the level of confidence 
about the prior specification. The quantity a = is the global precision, as 

it accounts for the level of precision of all the s conditional distributions. 

When Ui is large relative to ai, so that the ratio Ui/{ai + Ui) is approxi- 
mately 1, the Bayesian estimate reduces to the classical estimate given by the 
ratio between the number riy of times the transition has been observed and the 
number Ui of times the variable has visited state i. In this way, the estimate of 
the transition probability py is approximately 0 when riy = 0 and Ui is large. 
The variance of the posterior estimate Py is Py (1 — Pij)/{ai -|- -|- 1) and, for 

fixed Pij, it is a decreasing function of ai + Ui, the local precision augmented by 
the sample size n^. Hence, the quantity ai + Ui can be regarded as a measure 
of the confidence in the estimates: the larger the sample size, the stronger the 
confidence in the estimate. 

3.2 Clustering 

The second step of the learning process is an unsupervised agglomerative clus- 
tering of MCs on the basis of their dynamics. The available data is a set S' = {Si} 
of m time series. The task of the clustering algorithm is two-fold: find the set 
of clusters that gives the best partition according to some measure, and assign 
each MC to one cluster. A partition is an assignment of MCs to clusters such that 
each time series belongs to one and only one cluster. 

We regard the task of clustering MCs as a Bayesian model selection prob- 
lem. In this framework, the model we are looking for is the most probable way 
of partitioning MCs according to their similarity, given the data. We use the 
probability of a partition given the data — - i.e. the posterior probability of the 
partition — as scoring metric and we select the model with maximum posterior 
probability. Formally, this is done by regarding a partition as a hidden discrete 
variable C, where each state of C represents a cluster of MCs. The number c of 
states of C is unknown, but the number m of available MCs imposes an upper 
bound, as c < TO. Each partition identifies a model Me, and we denote by p{Mc) 
its prior probability. By Bayes’ Theorem, the posterior probability of Me, given 
the sample S, is 



p{Me\S) = 



p{Me)p{S\Me) 

P(S) 
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The quantity p{S) is the marginal probability of the data. Since we are comparing 
all the models over the same data, p{S) is constant and, for the purpose of 
maximizing p{Mc\S), it is sufficient to consider p{Mc)p{S\Mc). Furthermore, 
if all models are a priori equally likely, the comparison can be based on the 
marginal likelihood p{S\Mc), which is a measure of how likely the data are if the 
model Me is true. 

The quantity p{S\Mc) can be computed from the marginal distribution (pk) 
of C and the conditional distribution {pkij) of Xt\Xt-i = i,Ck — where Ck 
represents the cluster membership of the transition matrix of Xt\Xt-i — using 
a well-known Bayesian method with conjugate Dirichlet priors mu Let Ukij be 
the observed frequencies of transitions i ^ j in cluster Ck , and let riki = 
be the number of transitions observed from state i in cluster Ck- We define mk 
to be the number of time series that are merged into cluster Ck- The observed 
frequencies {rikij) and (m^) are the data required to learn the probabilities (pkij) 
and (pk) respectively and, together with the prior hyper-parameters akij, they 
are all that is needed to compute the probability p{S\Mc), which is the product 
of two components: f{S,C) and f{S,Xt-i,Xt,C). Intuitively, the first quantity 
is the likelihood of the data, if we assume that we can partition the m MCs into 
c clusters, and it is computed as 



f{S,C) 



r{a) -p|- r{ak + m,fc) 
+ r{ak) 



The second quantity measures the likelihood of the data when, conditional on 
having c clusters, we uniquely assign each time series to a particular cluster. 
This quantity is given by 






rjaki) 

ripiki -\- Uki) 



s 



n 



C{o^kij T rikij) 

r{a.kij) 



where T'(-) denotes the Gamma function. Once created, the transition probability 
matrix of a cluster Ck — obtained by merging time series — can be estimated 
3'S Pkij — ipikij T rikij) / i^eXki T Hki) - 

In principle, we just need a search procedure over the set of possible parti- 
tions and the posterior probability of each partition as a scoring metric. How- 
ever, the number of possible partitions grows exponentially with the number 
of MCs to be considered and, therefore, a heuristic method is required to make 
the search feasible. The solution we propose is to use a measure of similarity 
between estimated transition probability matrices to guide the search. Let P\ 
and P2 be transition probability matrices of two MCs. We adopt, as measure 
of similarity, the average Kulback-Liebler distance between the rows of the two 
matrices. Let puj and p2ij be the probabilities of the transition i — > j in P\ 
and P2- The Kulback-Liebler distance of these two probability distributions is 
D{pii,p2i) = J2j=iPiij^^SPiij/P2ij and the average distance between Pi and 
P2 is then D{Pi,P2) = J2iD{pu,P2i)/s- 
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Our algorithm performs a bottom-up search by recursively merging the clos- 
est MCs (representing either a cluster or a single trial) and evaluating whether 
the resulting model is more probable than the model where these MCs are sepa- 
rated. When this is the case, the procedure replaces the two MCs with the cluster 
resulting from their merging and tries to cluster the next nearest MCs. Otherwise, 
the algorithm tries to merge the second best, the third best, and so on, until the 
set of pairs is empty and, in this case, returns the most probable partition found 
so far. The rationale behind this ordering is that merging closer MCs first should 
result in better models and increase the posterior probability sooner. Note that 
the agglomerative nature of the clustering procedure spares us the further effort 
of assigning each single time series to a cluster, because this assignment comes 
as a side effect of clustering process. 

We conclude this section by suggesting a choice of the hyper-parameters 
otkij- We use uniform prior distributions for all the transition probability ma- 
trices considered at the beginning of the search process. The initial m x s x s 
hyper-parameters akij are set equal to a/{ms^) and, when two MCs are simi- 
lar and the corresponding observed frequencies of transitions are merged, their 
hyper-parameters are summed up. Thus, the hyper-parameters of a cluster cor- 
responding to the merging of ruk initial MCs will be rrika/ (ms^). In this way, 
the specification of the prior hyper-parameters requires only the prior global 
precision a, which measures the confidence in the prior model. An analogous 
procedure can be applied to the hyper-parameters ak associated with the prior 
estimates of pk- We note that, since r{x) is defined only for values greater than 
zero, the hyper-parameters akij must be non-negative. 



4 Clusters of Dynamics 

The 81 times series generated with AFS for the Capture the Flag scenario consist 
of 42 trials in which the blue team captures the red flag (end state A), 17 trials 
in which the blue forces are defeated (end state B) and 22 which were stopped 
after 125 time steps (end state C). 

We used our clustering algorithm to partition the times series according 
to the dynamics they represent. A choice of a prior global precision a = 972 
— corresponding to the initial assignment akij = 1/12 in the 81 transition 
probability matrices — yields 8 clusters. Table El gives the assignment of time 
series to each of the 8 clusters. By analyzing the dynamics represented by each 
cluster, it is possible to reconstruct the course of events for each trial. We did 
this “by hand” to understand and evaluate the clusters, to see whether the 
algorithm divides the trials in a significant way. We found that, indeed, the 
clusters correspond not only to end states, but different prototypical ways in 
which the end states were reached. 

Clusters C2, C 4 and C5 consist entirely of trials in which blue captured the 
flag or time expired (end state A and C). While this may at first be seen as 
the algorithm’s inability to distinguish between the two events, a large majority 
(though it is not possible to judge how many) of the “time-outs” were caused 
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Table 3. Summary of the clusters identified by the algorithm. 



Cluster 


A 


B 


C 


Total 


Cl 


5 


1 


3 


9 


C 2 


2 


0 


2 


4 


C 3 


7 


0 


0 


7 


C 4 


14 


0 


12 


26 


Cs 


1 


0 


1 


2 


Ce 


8 


16 


4 


28 


C 7 


2 


0 


0 


2 


Cs 


3 


0 


0 


3 


Total 


42 


17 


22 


81 



by the blue team’s inability to capitalize on a favorable circumstance. A good 
example is a situation in which the red team is eliminated, but the blue blobs 
overlap in their attempt to reach the fiag. This causes them to slow to a speed at 
which they were unable to move to the fiag before time expires. Only a handful 
of “time-outs” represent an encounter in which the red team held the blue team 
away from the fiag. Clusters C2, C4 and C5 demonstrate that the clustering 
algorithm can identify subtleties in the dynamics of trials, as no information 
about the end state is provided, implicitly or explicitly, by the world state. 

Clusters Ci and Cq merge trials of all types. Ci is an interesting cluster of 
drawn out encounters in which the advantage changes sides, and blobs engage 
and disengage much more than in the other clusters. For example, Ci is the only 
cluster in which the MC visits all states of the variable and, in particular, is the 
only cluster in which state 8 is visited. By looking at the transition probabilities, 
we see that state 8 is more likely to be reached from state 6, and to be followed by 
state 0. Thus, from a condition of equal free units (F2) we move to a situation 
in which blue disengages a unit and has a free unit advantage (F3), which is 
immediately followed by a situation in which red has a free units advantage (FI). 
The “time-outs” (end state C) in this cluster represent the red team holding off 
the blue team until time runs out. 

Cluster Cq, on the other hand, contains all but one of the trials in which the 
red team eliminated all of the blue units (end state B), as well as very similar 
trials where the red blobs appear dominant, but the blue team makes a quick 
move and grabs the flag. The cluster is characterized by having transitions among 
states 0, 4 and 10, with a large probability of staying in state 0 (in which the 
red forces are dominant) when reached. The large number of trials in which the 
blue team wins (especially large when we realize that C-endings are blue wins 
but for the fact that overlapping forces move very slowly) is a result of Blue 1 
being tasked to escort Blue 2, a tactic which allows Blue 1 to adapt its actions 
to a changing environment more readily than other unit’s tactics, and in many 
trials, gives blue a tactical advantage. 
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Cluster 7 



■--^1 j . 



Cluster 8 



Fig. 1. Markov Chains representing clusters C 3 , C 7 and Cg. 



Clusters C3, C 7 and Cg merge only times series of end state A, in which the 
blue team always captured the flag. Figure Q] displays the MC representing the 
three clusters (in which we have removed transitions with very low probability) . 
Each cluster captures a different dynamics of how a blue victory was reached. 
For example, cluster Cg is characterized by transitions among states 1, 5, 7 and 
11 in which the blue team maintains dominance, and transitions to states 4 and 
8 — in which the red forces are dominant — are given a very low probability. 
Indeed, the number of time steps of the trials assigned to cluster Cg was always 
low, as the blue team maintained dominance throughout the trials and states 4 
and 8 were never visited. 

The trials in cluster C7 visited states 0, 4, and 10 frequently and correspond 
to cases in which the blue team won despite a large mass deficit. In these cases, 
the objective was achieved by a break away of one of the blue blobs that outruns 
the red blobs to capture the flag. The trials assigned to cluster C7 concluded 
with victory of the blue team despite a large mass deficit (the objective was 
achieved by a break away of one of the blue blobs that outruns the red blobs to 
capture the flag). Cluster Cg displays transitions among states 0, 1, 4, 5, 6, 10 
and 11 and represents longer, more balanced encounters in which the blue team 
was able to succeed. 

5 Conclusions 

Our overriding goal is to develop a program that automatically generates ex- 
planations of time series data, and this paper takes a step toward this goal by 
introducing a new method for clustering by dynamics. This method starts by 
modeling the dynamics as MCs and then applies a Bayesian clustering procedure 
to merge these MCs in a smaller set of prototypical dynamics. Explaining half a 
dozen clusters is much easier than explaining hundreds of time series. Although 
the explanations offered in this paper are still generated by human analysts — 
we have not yet achieved fully-automated explanation — the explanatory task 
is made much easier by our method. 
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Abstract. The capability of making use of existing prior knowledge is 
an important challenge for Knowledge Discovery tasks. As an unsuper- 
vised learning task, clustering appears to be one of the tasks that more 
benefits might obtain from prior knowledge. In this paper, we propose a 
method for providing declarative prior knowledge to a hierarchical clus- 
tering system stressing the interactive component. Preliminary results 
suggest that declarative knowledge is a powerful bias in order to im- 
prove the quality of clustering in domains were the internal biases of the 
system are inappropriate or there is not enough evidence in data and 
that it can lead the system to build more comprehensible clusterings. 



1 Introduction 

Clustering is a data mining task aiming to discover useful patterns in the data 
without any external advice. As opposed to classification tasks, where the goal 
is to build descriptions from labeled data, clustering systems must determine 
for themselves the way of dividing the objects. Several clustering methods have 
originated from statistics and pattern recognition, providing a wide range of 
choices. As pointed out by early machine learning work, a problem with most 
clustering methods is that they do not facilitate the interpretation task to users. 
A new proposal, referred to as conceptual clustering |5| was made in order to 
solve this problem. Importantly, the original formulation of conceptual clustering 
stated that learning should exploit any existing background knowledge. However, 
there is a lack of approaches concerned with this issue. From another point of 
view. Knowledge Discovery in Databases (KDD) has emerged as a new discipline 
combining methods from statistics, pattern recognition, machine learning and 
databases. The KDD process is described as an iterative task involving several 
steps with an important participation of users, especially providing background 
knowledge. However, clustering is often deemed as a knowledge- weak task and, in 
general, neither statistical nor machine learning methods make any assumption 
about the existence of prior knowledge. 

Despite this lack of attention, clustering is a task that may obtain great 
benefits from using prior knowledge. The results of any inductive learning task 
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are strongly dependent on the data. This is particularly true in the case of 
clustering, since there is no target variable to guide the process as in supervised 
tasks. Therefore, noise, incomplete or incorrect data may pose hard problems 
for building clusterings. For that reason, it appears desirable to build clustering 
systems with the capability of using external knowledge in the inductive process. 

In this paper we present a method to incorporate prior knowledge into the 
process of constructing hierarchical clusterings. First we discuss the role of prior 
knowledge in unsupervised domains and point out some key issues. Next, we 
briefly review the ISAAC conceptual clustering system, which is the system em- 
ployed in the experiments. Then, we propose a way of incorporating prior knowl- 
edge into the process and examine an example of using the method. An empirical 
study is performed using several data sets from the UCI Repository and some 
conclusions are presented. 

2 Guiding Clustering with Prior Knowledge 

We view the role of background knowledge in the clustering process under three 
different dimensions, namely, search, comprehensibility and validation. The in- 
fluence of prior knowledge is easier to understand if we consider clustering as 
a search in a hypothesis space For any given data set, a potentially infinite 
number of hypotheses may be formulated and the problem of exploring all of 
them becomes intractable. To address this problem, clustering methods -and, in 
general, any inductive method- have to determine which hypotheses are better 
and discard the rest. The factors that influence the definition and selection of 
inductive hypotheses are called a bias |2|. It is easy to see that every inductive 
learning algorithm must include some form of bias, that is, it will always prefer 
some hypothesis over another. Since clustering is in nature a data-driven pro- 
cess, we can only expect to discover concepts that are clearly reflected in data. Of 
course, if the system uses the correct bias, it could discover any existing concept. 
However, in inductive systems, internal biases only take into account informa- 
tion provided by observed data, so they need some sort of external advice to 
change their behavior. In this sense, we can see the use of prior knowledge as an 
external bias that constraints the hypotheses generated and address the system 
towards the desired output. Therefore, prior knowledge overrides undesirable 
existing knowledge gathered from incomplete or incorrect data. 

The second dimension related to using prior knowledge is comprehensibility. 
Comprehensibility have been typically addressed by incorporating a bias towards 
simplicity. Simplicity is often measured as the number of features used to de- 
scribe the resulting concepts and hence, under two equally interesting hypothe- 
ses, systems would prefer the one which has the shorter description. However, 
this bias does not completely guarantee results to be comprehensible to users, 
rather it produces readable results. Readability is a desired property for results 
in order to be comprehensible, but it is not the only one. A system may present 
very readable results but ignore some important concept or relationship from the 
user point of view so that the results are not fully interpretable. In this sense. 
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Let P = {Cl, C 2 , . . . , Ck} be the initial partition 
Let NG be the level of generality desired 

Function Isaac(P, NG) 

while Gen(P, NG) < 0 do 

Let C be the least general cluster in P 

Compute the similarity between G and the rest of clusters in P 
Merge G with the most similar concept in P 

endwhile 



Table 1. The ISAAC algorithm. 



comprehensibility may be viewed as strongly related with the user’s goals and 
intuitions about the domain. Under this point of view, prior knowledge reflects 
specific user goals or reasoning paths about the concepts to be discovered and 
may contribute to obtain more interpretable results. 

Finally, we can view prior knowledge as related to the validation of knowledge. 
The system can provide the user with the chance to alter its biases. With the 
selected bias, induction is performed and the results passed back to the user, 
who evaluates these results and decides whether they are satisfactory or not. 
In the later case, the user can choose a different bias. Prior knowledge acts 
just as another bias which can be provided by the user besides modifying the 
internal system biases. Validating the results of using some piece of knowledge 
implies to validate the correctness of this knowledge. Therefore, allowing users 
to express partial knowledge about the domain does not only serve as a guide 
for the clustering process, but also provides a mean of validating users theories. 

In sum, we view the role of prior knowledge in clustering as highly related 
to user interaction. In unsupervised settings it is likely that it will be hard to 
obtain certain and complete knowledge from users. Providing clustering systems 
with the ability of using partial knowledge becomes an important concern and 
should help to confirm or reject this knowledge and obtaining additional one. 

3 A Brief Introduction to ISAAC 

In this section we give a brief explanation of the ISAAC system 0 which will be 
used to exemplify the use of prior knowledge in clustering. Isaac is a conceptual 
clustering system that builds probabilistic concept hierarchies. A probabilistic 
description gives the feature-value distributions of the objects in a cluster. For 
each cluster Ck, the system stores the conditional probabilities P{Ai = Vij \ 
Ck) for each feature Ai and each value Vij. Isaac works with nominal values 
estimating probabilities from the frequencies over the observed data. 

Isaac proceeds by using a typical agglomerative algorithm as shown in Ta- 
ble d in which clusters -or objects- are repeatedly merged. However, Isaac 
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differs from statistical agglomerative approaches in several ways. First, it is is 
intended to allow users to guide the construction of the cluster hierarchy which 
better suits their needs. The user can use the NG parameter, which is in the [0,1] 
range, to specify both the number of levels and their generality in the hierarchy. 
As the NG value increases, the system creates more general partitions with few 
concepts. Lower NG values instruct the system to build more specific partitions. 
The user can interact with the system experimenting with different sets of values 
for this parameter. Since the effect of modifying the NG values is semantically 
clear to the user, it is easier to deal with this parameter than, for instance, to 
specify distance thresholds to decide cut points in a tree. The algorithm shown in 
Table ^ presents only the procedure applied to one NG value, existing an outer 
loop that iterates over the different parameter values specified by the user. The 
initial partition P will be the set of singleton clusters for the first iteration, and 
for each subsequent iteration it will be the previously obtained set of clusters. 

In addition, the system has not necessarily to construct binary trees as many 
hierarchical clustering algorithms do. Rather, it depends on the set of NG val- 
ues provided by the user. This occurs because the system does not store all the 
intermediate mergings performed when computing a new level for the hierar- 
chy. In order to decide when a level of generality given by the user have been 
reached, the system employs a probabilistic generality measure linked to the 
NG parameter. As shown in Tabled the algorithm also takes advantage of the 
generality measure to choose the candidates to merge, trying to produce bal- 
anced levels formed by concepts of approximately the same level of abstraction. 
Although the system incorporates other capabilities such as a feature selection 
mechanism, they are not used in this work (see [Z] for more details). Note that 
although incorporating some particular features, basically, the system is a hi- 
erarchical agglomerative clusterer and, therefore, the rest of the discussion and 
experiments might be generalized to similar algorithms as well. 



4 Using Declarative Knowledge with ISAAC 

Now we describe a method that enables ISAAC to exploit prior knowledge. Our 
focus is not only in finding a suitable bias to the system, but also in providing 
some feedback to the user about the relationship between their knowledge and 
that contained in the data. Users should experiment with different knowledge, 
in order to see the effect in the final results. This should provide users a mean 
to verify some uncertain hypotheses in the light of how they fit with the data. 

We allow the user to define a set of classification rules expressed in a subset of 
first-order logic language (FOL). Rules always involve one universally quantified 
variable and an implication of the form 

\/x : P{x) Q{x) 

where P{x) is a conjunction of predicates of the form p{x,v) indicating that 
X takes the value v for the property p and Q{x) is a predicate indicating that 
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Fig. 1. Using declarative knowledge to constraint the clustering process 



X belongs to class q (e.g. \/x : legs{x,4) A milk{x,yes) mammal{x)). In 
practice, we can omit the universal quantifier and express rules as a conjunctive 
combination of feature-value pairs: 



(Ai = Vi,) A (AI 2 = U 2 ,) A ... A (2l„ = Vnj) => Ck 



where Ai is one of the features present in the dataset and Vij is one of the 
legal values for feature i. Ck is a, dummy label for the set of objects that satisfy 
the rule and that may have some sense for the user (e.g. legs = 4 A milk = 
yes ^ mammal). If the user defines several rules with the same label, they are 
considered as a disjunction and unified into a single rule. As presently defined, 
all the predicates must reference some feature present in the data set. 

Under the Isaac framework it is relatively easy to incorporate the bias pro- 
vided by the set of rules to the clustering process. Specifically, we implemented 
a sort of meta-level that constraints the mergings that can be done during the 
agglomerative process. The goal is to obtain a cluster hierarchy consistent with 
the knowledge provided by the rules. 

During the clustering process, there are clusters that partially represent some 
rule in the sense that by merging all these clusters together, we would have all 
the objects that the rule covers. The meta-level does not allow neither to merge 
any pair of clusters containing objects covered by different rules, nor merging a 
cluster with objects not covered by any rule with another with objects covered by 
some rule. For example, clusters Cl and C2 in Figure G] cannot be merged with 
clusters C3 — C7, because this merging would prevent to finally form cluster C8 
covering all the objects described by Rule 1. These constraints guarantee that, 
at certain level of generality, a cluster that completely satisfies each rule will 
be formed. Once a cluster includes all the instances covered by some rule, the 
meta- level removes all the constraints associated with this cluster. For example, 
cluster C3 covers all the objects of Rule 2 and, therefore, it can be merged with 



216 



Luis Talavera and Javier Bejar 



clusters (74 — (77. However, it cannot be merged with clusters (71 and (72 because 
of the constrains associated to these later clusters. 

As a result, the user is provided with information about the level of generality 
in which the rule fits in the obtained hierarchy. Also, statistics about the number 
of mergings which have not been allowed are shown in order to give a rough 
approximation of the influence of the declarative knowledge in the process. The 
relative position in the hierarchy of the objects covered by a rule is a highly 
useful information because it indicates the level of abstraction of the rule in 
the context of a given dataset. The user may exploit this information in two 
different ways. First, he has new knowledge in order to decide which levels the 
hierarchy must include as regards to the hypotheses provided to the system. 
Second, he can modify the rules by generalizing or specializing the conditions 
included according to the results provided. 

5 An Example of Interactive Clnstering 

In this section we present an example of application of the presented approach 
simulating interaction with a user. We used a subset of 1000 objects from the 
mushroom data set obtained from the UCI Repository, in which missing values 
have been substituted by the most frequent value. In these experiments the 
aim is to rediscover the original two-class division into edible and poisonous 
mushrooms. In a real world problem, a user would be present in order to provide 
background knowledge and evaluate the results. In this example, in the absence 
of a real user, we simulate the interaction using the number of correctly classified 
objects to evaluate the quality of the clusterings. Therefore, we assume that a 
user would prefer the more accurate cluster. Note that labels are used in the 
external evaluation and they are not used in any case during clustering. 

Initially, assuming that no information is available, we constructed tentative 
clusterings by providing a set oi NG values with an increment of 0.05 until 
obtaining two clusters at the top level. Recall that ISAAC is an agglomerative 
algorithm, so that it does not provide a single optimal partition, and that it needs 
a set oi NG values in order to decide which are going to be the levels in the 
hierarchy. The best (i.e., the more accurate at the top level) of these clusterings 
was used in the rest of this experiments. Additionally, we ran the C4.5 supervised 
system 0 for inducing a decision tree over the same data. From the decision tree, 
we extracted some rules that were used to simulate the background knowledge 
provided by the user. 

We observe that constructing a two-class top level partition provides a rea- 
sonably good description of the domain (89.50% of accuracy) but it is not com- 
pletely satisfying. So we incorporate a rule that we think that classifies a class 
of mushrooms we will denote p: 

IF odor=f THEN p 

The hierarchy reinforces this intuition since most objects covered by this 
rule have already been clustered together. Using this rule to bias clustering, 
we get a new partitioning at the top level that appears to be somewhat better 
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(92.90%). Additional knowledge suggest that another rule should hold for this 
data. Examining the hierarchy we note that an inner node contains a group that 
seems to correspond to our assumptions. However, the node is not clustered with 
our p class, as we would like. So we add the rule 

IF odor=c THEN p 

The resulting top level now appears to be a very good partition for this data 
set (95.60%). Note that, individually, these two rules do not provide completely 
new ways of grouping objects regarding the original decisions of the algorithm. 
However, considered together as describing a larger class (because they are both 
pointing to the same dummy label), they change the bias of the algorithm by 
forcing to group two subgroups that, originally, would have not been merged. 

Despite these improvements, we still have an intuition about a type of p 
mushrooms defined by the rule: 

IF odor=n AND spore-print-coIor=r THEN p 

Again, we incorporate the rule to our process. Curiously, now we obtain a 
worst top level partition than before, even worst that the one obtained without 
using prior knowledge (70.20%). From the output of the system, we note that 
the last rule only covers 9 out of the 1000 objects, so we suspect that, perhaps, 
there is not enough evidence for supporting the addition of these few objects to 
our p class. A further exploration into the results reveals that the rule holds at 
the level obtained with NG = 0.40, a level partitioning the data into 7 clusters. 
By examining these clusters we observe that, effectively, a cluster representing 
the rule has been created, but the overall partition is not very good at this level 
either (83.70%). Is it the rule wrong? Not necessarily, since we have observed part 
of these objects grouped together before applying the rule. Figure 0 depicts the 
distribution of p and e mushrooms in part of the level formed with NG = 0.40. 
Figure 13 (a) represents the clusters formed by the original algorithm with no 
external guidance, while Figure 0(b) shows the clusters obtained with the three 
rules. Note how the plain algorithm clusters 103 objects from the p class mixed 
with a number of objects of the e class. The improvement with the two first 
rules stems from forcing part of these objects to be clustered together with 
other objects from the p class forming a group of 296 objects. When adding 
the last rule, the group of objects induced increases only up to 305 objects. 
However, as shown in Figure El (b), forcing these few additional objects to be 
clustered together with our big p class, might be introducing a misleading bias. 
The problem is that the formed cluster is heterogeneous from the viewpoint of 
the data, because the last 9 objects were more similar to other objects in the e 
class than to these objects in the p class. By constraining a number of mergings 
suggested by the data, the algorithm comes up with an odd result. The lesson 
we learn from that is that, although we could be convinced that the 9 objects 
covered by the last rule effectively are part of the p class, the central tendency 
of the objects in the class is somewhat different, that is, these 9 objects are 
what could be deemed as an exception. In a probabilistic concept hierarchy, the 
way to represent this sort of exceptions is by allowing the special objects to be 
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Fig. 2. Distributions of object labels in different mushroom hierarchies obtained 
with different levels of prior knowledge. 



clustered with the wrong class at the top levels and creating internal disjuncts 
represented by inner nodes in the hierarchy. 

We can obtain this behavior by using the same rule and changing the dummy 
label, so that the 9 objects are clustered together but they are not forced to be 
part of the big p class. The rule works as expected as shown in Figure 0(c). Now 
we have the 9 objects clustered together in an inner level of the hierarchy as a 
child of the non-p class and the quality of the top level partition is maintained 
(95.60%). Finally, we think that mushrooms having odor=p should also be clus- 
tered in the p class. By examining the last hierarchy, we note that the system 
already has clustered together the objects satisfying this condition in an inner 
node in the fifth level. As in the previous case, it is unlikely that the system 
could cluster that small number of objects (32) correctly in the top level since 
they appear to be very similar to the non-p class mushrooms, being this internal 
disjunction the best solution. 



6 Experiments 

The following experiments compare the performance of the ISAAC clustering 
algorithm with and without using declarative knowledge. The evaluation is done 
on 3 data sets from the UCI Repository. Although agglomerative approaches 
are not as dependent on the ordering of presentation of instances as incremental 
systems, they depend on local merging decisions made during the clustering 
process. There might be several clusters in the initial stages of the process scoring 
the same similarities and the decision of which pair to merge first may result in 





Integrating Declarative Knowledge in Hierarchical Clustering Tasks 219 



Table 2. Declarative knowledge used in the experiments 



Data set 


Rules 


Objects 


voting 

records 


RULE 1: 

IF physician-f ee-freeze=n AND 
adopt ion-of-the-budget-resolution=y AND 
superfund-right-to-sue=y AND 
synfuels-corporation-cutback=y THEN democrat 


37 (8.51%) 


RULE 2: 

IF physician-f ee-freeze=n AND 

adoption-of-the-budget-resoIution=n THEN democrat 


25 (5.75%) 


RULE 3: 

IF physician-f ee-freeze=n THEN democrat 


258 (59.31%) 


mushroom 


RULE 1: 

IF odor=c THEN poisonous 


27 (2.70%) 


RULE 2: 

IF odor=f THEN poisonous 


269 (26.90%) 


promoters 


RULE 1: 

IF p-35=g AND p-12=t THEN - 


4 (3.77%) 


RULE 2: 

IF p-35=t AND p-12=t AND p-10=t THEN + 


5 (4.71%) 


RULE 3: 

IF p-35=t AND p-12=a THEN + 


19 (17.93%) 



different final partitions. Therefore, all the results presented correspond to the 
average of 30 different runs of the system in order to account for this variability. 

We used C4.5 to induce a decision tree for each data set. From this tree we 
extracted several classifying rules that were provided as background knowledge 
to Isaac in additional runs. Table El shows the rules used for each data set 
and the number of total objects that each rule covers. Table 0 show the results 
obtained for different runs without background knowledge, with each rule and 
with combinations of some rules. As mentioned before, combinations of rules 
pointing to the same dummy label are treated as just one disjunctive rule, so 
that the total number of objects covered by the combined rule is the sum of the 
number of objects covered by its components. Results are the average percentage 
and standard deviations of correctly classified instances obtained by labeling each 
cluster with the label corresponding to the majority value of the class. 

Results show that incorporating prior knowledge into the clustering process 
yields an important performance improvement. This improvement is especially 
interesting in the promoters data set, which combines a high dimensionality with 
a small number of objects that makes clustering very difficult and prone to high 
variability. It is worth to notice that it is not necessary to use very general rules 
in order to provide an appropriate bias to the system. In the three data sets, 
rules covering about a 15% of objects suffice to increase the performance of the 
clustering system. Of course, more general rules provide a stronger bias as shown 
by Rule 3 in the voting records data set. However, rules of intermediate generality 
appear to be powerful enough biases. The combination of rules 1 and 2 in the 
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Table 3. Results of using declarative knowledge on different data sets 



Data set 


Rule Base 


Accuracy 


voting 

records 

(435 objects, 16 feat.) 


None 
RULE 1 
RULE 2 
RULE 1+2 
RULE 3 


89.42 ± 2.03 
90.56 ± 0.96 
92.36 ± 0.47 
93.22 ± 0.48 
95.63 ± 0.00 


mushroom 

(1000 objects, 22 feat.) 


None 
RULE 1 
RULE 2 
RULE 1+2 


89.28 ± 0.25 
89.20 ± 0.25 
92.68 ± 0.25 
95.35 ± 0.25 


promoters 

(106 objects, 57 feat.) 


None 
RULE 1 
RULE 2 
RULE 1+2 
RULE 3 


64.75 ± 11.00 
63.90 ± 8.75 
57.89 ± 7.49 
67.36 ± 9.22 
73.87 ± 6.96 



mushroom data set and Rule 3 in the promoters data set, importantly boosts 
performance and covers about a 30% and 18% of objects, respectively. On the 
other hand, not every correct rule is a good bias for the system. For example. Rule 
1 in the mushroom data set and Rules 1 and 2 applied alone in the promoters 
appear to decrease performance. This highlights an important issue of using prior 
knowledge in clustering: good rules are not only rules that correctly classify 
objects, but rules that appropriately modify the original biases of the system 
towards the desired results. Therefore, as a side result, experiments suggest the 
need for clustering systems of providing mechanisms of user interaction and 
comprehensible feedback in order to help users in validating his knowledge. 

Besides modifying the bias of the system, we have pointed out that back- 
ground knowledge should improve comprehensibility. In a hierarchical cluster- 
ing, comprehensibility can be measured by counting the number of nodes that 
have to be considered in a classification path until achieving a classification. 
Note that testing nodes at the same level is roughly like testing disjunctions in 
a logical description, while descending into inner nodes is analogous to testing 
conjunctions. Ideally, one would like to make accurate predictions by descending 
only to a limited depth into the hierarchy and testing a limited number of nodes. 

We run an additional experiment with the promoters data set by dividing 
the data into two disjoint subsets containing the 70% and 30% of objects, clus- 
tering the first subset, and using the second as a test set to predict the -unseen 
during learning- label. We made two sort of runs, one without background knowl- 
edge and another one using rule 3 from the previous experiments. For both, we 
recorded different accuracies by limiting the depth of the hierarchy at which 
predictions were made. These predictions are obtained from the majority value 
of the label in the reached node. Figure 0 shows the trade-off between accuracy 
and comprehensibility -measured as the number of tested nodes- obtained from 
30 independent runs. Clearly, by using background knowledge, the system is able 
to make accurate predictions at the top levels of the hierarchy, so providing an 
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Fig. 3. Accuracy comprehensibility trade-off with and without background 
knowledge. 



easy interpretation of the domain. By contrast, the original algorithm needs to 
test a higher number of nodes in order to achieve the same accuracy and, thus, 
suggests more complex explanations. Although the original algorithm attains a 
somewhat higher accuracy than using rules by descending deeper into the hierar- 
chy, this improvement is achieved at the expense of a loss of comprehensibility. 
As a conclusion, results support our claims that using background knowledge 
improves the accuracy-comprehensibility trade-off in clustering tasks. 

7 Related Work 

The idea of combining declarative knowledge and inductive learning is not new. 
Explanation Based Learning (EBL) systems use the background knowledge con- 
tained in a domain theory to prove that an example is a member of a class 
and deduce a new rule which classifies more efficiently new similar examples. 
Most of the approaches integrating EBL and inductive learning stress the de- 
ductive component, so that they still rely strongly in a considerable amount 
of background knowledge [^. Moreover, usually the inductive component is a 
supervised one, since EBL requires labeling of examples. On the contrary, we 
approach the integration of theory and data emphasizing the inductive compo- 
nent. Prior knowledge is viewed as a bias to help a data-driven process that, in 
turn, provides a confirmation on the validity of this knowledge. 

Another paradigm that automatically integrates background knowledge into 
the learning process is Inductive Logic Programming (ILP), although there is 
a small body of research in clustering in this area. A recent exception is P, 
although its evaluation is very limited with respect to the typical UCI data sets 
used for conceptual clustering. Moreover, agglomerative methods like ISAAC are 
common in hierarchical clustering. Since our method does not depend on any 
particular feature of the system, we think that it could be readily applied to 
any agglomerative approach, thus providing a way of integrating background 
knowledge without necessarily shifting to ILP tools. 
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8 Concluding Remarks 

This work presents a methodology for integrating declarative knowledge into 
clustering tasks. We stress the use of prior knowledge as related to user interac- 
tion and advocate for clustering systems that can provide better interaction and 
feedback to the users in order to take full advantage of prior knowledge. Since in 
unsupervised settings it is likely that prior knowledge is uncertain and incom- 
plete, we address the use of this knowledge as a bias to a data-driven process, 
which has the primary role in the learning task. 

Results clearly show the benefits that we can obtain from using prior knowl- 
edge in the clustering task as regards accuracy and comprehensibility. We have 
also given a preliminary outline of the interactive use and validation of prior 
knowledge in clustering. Results highlight the important role of interaction and 
feedback in clustering in order to validate users theories. This becomes especially 
important because not every correct rule is a good bias for the clustering system, 
so that the user should be able of gain an appropriate insight into clustering re- 
sults to decide which part of his knowledge is likely to be useful. With regard to 
this subject, we have outlined the benefits of a clustering system that allows the 
user to decide the degree of generality of the levels in a cluster hierarchy coupled 
with the use of prior knowledge. Other improvements such as providing feature 
relevances as a result of clustering may result in still better feedback. 
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Abstract. A method for the linear discrimination of two classes has 
been proposed by us in [3]. It searches for the discriminant direction 
which maximizes the distance between the projected class-conditional 
densities. It is a nonparametric method in the sense that the densities 
are estimated from the data. Since the distance between the projected 
densities is a highly nonlinear function with respect to the projected di- 
rection we maximize the objective function by an iterative optimization 
algorithm. The solution of this algorithm depends strongly on the start- 
ing point of the optimizer and the observed maximum can be merely a 
local maximum. In [3] we proposed a procedure for recursive optimization 
which searches for several local maxima of the objective function ensur- 
ing that a maximum already found will not be chosen again at a later 
stage. In this paper we refine this method. We propose a procedure which 
provides a batch mode optimization instead an interactive optimization 
employed in [3]. By means of a simulation we compare our procedure 
and the conventional optimization starting optimizers at random. The 
results obtained confirm the efficacy of our method. 



1 Introduction 

We discuss discriminant analysis which searches for a discriminant direction 
by maximizing the distance between the projected class-conditional densities. 
Unfortunately this distance is a highly nonlinear function with respect to the 
projected directions, and has more than one maximum. In most applications 
the optimal solution is searched for along the gradient of the objective function, 
hoping that with a good starting point the optimization procedure will converge 
to the global maximum or at least to a practical one. Some known techniques 
such as principal component analysis, Fisher discriminant analysis and their 
combination [1],[2] may be used for choosing a starting point for the optimization 
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procedure. Nevertheless, the observed maximum of the objective function can 
be merely a local maximum, which is far away from the global one in some 
data structures. In [3] we proposed a method for recursive optimization which 
searches for several large local maxima of the objective function. In this paper 
we refine this method. We propose a procedure for recursive optimization which 
ensures a batch mode optimization. Optimizing in this mode we replicate the 
recursive optimization using different starting points of the optimizer and then 
choose the best solutions from the trials done. 

Section 2 describes our method for discriminant analysis [3] , Section 3 presents 
our new proposal, and Sections 4 and 5 contain the results and analyses of the 
comparison based on the synthetic data sets. 



2 Discriminant Analysis by Recursive Optimization 

Suppose we are given training data (zi, Ci), (z 2 , C 2 ),..., (z^v^, catJ comprising 
a set Zt = {zi,Z 2 , . . . ,zi]Vt} of Nt training observations in n-dimensional sam- 
ple space (zjG IR”,n > 2) and their associated class-indicator vectors c^, j = 
l,2,...,A^i. We discuss a two class problem and we require that Cj is a two- 
dimensional vector Cj ={cij,C 2 j)’^ which shows that Zj belongs to one of the 
classes wi or W 2 . The components cij , C 2 j are defined to be one or zero according 
to the class-membership of Zj, i.e. cij = 1, C 2 j = 0 for Zj G iO\ and c\j = 0, 
C 2 j = 1 for Zj G u> 2 - The class-indicator vectors Cj imply decomposition of the 
set Zt into two subsets corresponding to the unique classes. We denote by Nt- 
the number of the training observations in class uii, for i = 1,2. 

Our method requires a normalization of the data, called sphering [6]. To 
achieve data sphering we perform an eigenvalue- eigenvector decomposition = 
RDR^ of the pooled sample covariance matrix estimated over training set 
Zt- Here R and D are n x n matrices; R is orthonormal and D diagonal. We 
then define the normalization matrix A = The matrix is assumed 

to be non-singular, otherwise only the eigenvectors corresponding to the non- 
zero eigenvalues must be used in the decomposition [6] . In the remainder of the 
paper, all operations are performed on the sphered training data Xt = {xj : 
Xj = A(zj — m^) , z j G Zt ,j = 1,2, Nt} with the sample mean vector 
estimated over Zj. For the sphered training data Xt the pooled sample covariance 
matrix becomes the identity matrix AS^A^ = I. 

We discuss discriminant analysis carried out by a linear mapping y = w^x, 
X G IR", y G IR^, n > 2, with x an arbitrary n-dimensional observation, and 
w a direction vector. We require w to have unit length, and y = w^x can be 
interpreted geometrically as the projection of the observation x onto vector w 
in x-space (Fig.l). 

We search for the discriminant vector w* 

w* = argmax{PF(w)} 

W 
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Fig. 1. Linear mapping (y = w^x) in a two- dimensional x-space. Class- 
conditional densities p(w^x|wi) and p(w^x|w 2 ) along the vector w. 



which maximizes the Patrick- Fisher (PF) distance [5] between the class-con- 
ditional densities along it. PF{w) denotes the PF distance along an arbitrary 
vector w 



PF{w) 





Nf, 



1 2 



-p(w^x|u;2) 



1/2 



dx 



( 1 ) 



with 

1 r 1 i 

p(w^x|w,) = [w^(x-Xj)]^j ,t = l,2 (2) 

the Parzen estimators with Gaussian kernels of the class-conditional densities of 
the projections y = w^x. Here x is an arbitrary observation (x G M”), Cy is 
the class- indicator which constrains the summation in (2) on the Ui- training 
observations (x^ corresponding to Cy = 1), and ft, is a smoothing parameter. 

PF(w) is a nonlinear function with respect to w. In order to search for 
several large local maxima of PF(w) we have proposed a method for recursive 
maximization of PF(w) [3]. We obtain a discriminant vector w* related to a 
local maximum PF(w*) and then we transform the data along w* into data 
with greater overlap of the class-conditional densities (deflated maximum of 
PF(w) at the solution w*), and iterate to obtain a new discriminant vector. 

In our method we use the PF distance (I) because of the existence of an an- 
alytical expression of its gradient [5] used in the iterative optimization. Actually 
our method is not restricted to the PF distance only. It can be applied to any 
other discriminant criterion, which has several local maxima with respect to w. 
In the case that an analytical expression of the gradient of the criterion can not 
be obtained we must estimate the gradient numerically. 
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The main point of the method is the procedure for deflating the local max- 
imum of PF{w) called Reduction of the Class Separation (RCS). In order to 
deflate PF{w) at w* (to increase class overlap along w*), we transform class- 
conditional densities along w* to normal densities. For this purpose, we rotate 
the data applying the linear transformation 



r = Ux (3) 

with U an orthonormal (n x n) matrix. We denote the new coordinates as 
ri, r 2 , . . . , (r =(ri,r 2 ,..., r„)^). We require that the first row of U is w*, which 
results in a rotation such that the new first coordinate of an observation x is 
ri = y = { w*)^x. Assume that p{y\uji),i = 1,2 are the class-conditional densi- 
ties of y = (w*)^x and their means and variances. We transform 

p{y\uji) to normal densities and leave the coordinates r 2 ,r 3 ,...,r„ unchanged. 
Let q be a vector function with components q\,q 2 , ■■■,qn that carries out this 
transformation: r\' = q\{y) with r\ having normal class- conditional distribu- 
tions and rf = qi{ri),i = 2, 3, . . . , n each given by the identity transformations. 
The function q\ is obtained by the percentile transformation method: 

- for observations x from class lv\'. 

(h{y) = (■F’(y|wi))] (cr^i^^ ± Acr^)i /2 {xny\^^ - Ami); (4) 

- for observations x from class UJ 2 ' 

<li{y) = {F{y\^2))] ± Acr^)i/2 -I- - Am 2 ). (5) 

Here, Acr^(0 < Acr^ < 1), Ami, Am 2 are user- supplied parameters, F{y\uji) 
is the class-conditional (cumulative) distribution function of y = (w*)^x for 
i = 1,2 and is the inverse of the standard normal distribution function <P. 
Finally, 

x' = C/^q(Ux) (6) 

transforms the class-conditional densities along w* to be normal densities 

p{ri'\ui) = - Am,, cr^|^. ± Act^) (7) 

leaving all directions orthogonal to w* unchanged. If we set Acr^ = 0 and 
Ami = 0, i = I, 2 we make minimal changes of the data in the sense of the 
minimal relative entropy distance measure between the original and transformed 
class-conditional distributions [6, p.254] and [7, p.456]. If ± Act^ = I and 
mj,|(^, — Ami = 0, t = 1,2 we transform the class-conditional densities along 
w* to 1V(0, 1) which results in full overlap of the classes along w*. This certainly 
eliminates the local maximum of the PF distance along w*, but it causes large 
changes of the distributions of the transformed data x' (6) in some applications. 
In order to direct the local optimizer to a new maximum of PFfw), and to keep 
the class-conditional densities of x' (6) as close to the densities of the original 
data X as possible we search for the smallest values of the parameters Acr^, 
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Ami, Am 2 that result in a deflated PF distance along w*. We start our search 
with Acr^ = 0 and Am^ = 0,i = 1,2 (minimal changes of the data) and then 
we make trials increasing the values of Act^ in the interval (0 < Act^ < 1). We 
choose the sign ( + or - ) of the change (±Acr^) in order to approach ± Act^ 
to 1. We assign the latter value to 1 if it crosses 1. For each Acr^ we compute 
the values of Ami and A m 2 by an expression proposed by us in [3,p.294]. 

We presented the RCS in its abstract version based on probability distribu- 
tions. The application to observed data is accomplished by substituting estimates 
of F{y\uJi), va.y\^^. and over the training set W- 

3 Batch Mode Recursive Optimization Procedure 

Here we refine our recursive optimization procedure developed in [3]. The idea 
is to ensure optimization in a batch mode instead of the optimization in an 
interactive mode proposed in [3] . For this purpose we propose a procedure which 
performs successive modification of the training data automatically (without 
man-machine interactions). In order to formalize this procedure we introduce 
the following nomenclature: 

- Xt denotes the original (sphered) training data. 

- w* is the directional vector corresponding to the local maximum of the PF 
distance for the original data Xt. 

- X( denotes the training data used in the current iteration of the procedure. 

- w is the directional vector corresponding to the local maximum of the PF 
distance for the current training data Xj. 

- X/ denotes the modified training data which has a deflated PF distance 
along w. 

- w' is the directional vector corresponding to the local maximum of the PF 
distance for the modified training data X^'. 

We propose the following computational procedure: 

Step 1 Starting from a directional vector we maximize the PF distance for 
the original training data Xt. We save the optimal solution denoted by w*. 

Step 2 Initialization of the Reduction of the Class-Separation (RCS): We 
initialize the current training data Xj with the original training data Xt (Xj = 
Xt) and the current optimal solution w with the optimal solution w* for Xt 
(w = w*). We set Acr^ = 0 and Ami = 0, for i = 1,2. This setting implies 
minimal changes of the data during the RCS. 

Step 3 Running the RCS: We estimate the class- conditional means, vari- 
ances and (cumulative) distribution functions over the projections yj = (w)^Xi 
of the current training observations Xj € Xj onto the current optimal vector w. 
We substitute these estimates into (4) and (5), transform ij G Xj using (6), and 
obtain the modified training data X/ = {x^' : Xj' = U^q(Uxj) j = 1, 2 . . . Nt} 
which has a deflated PF distance along the optimal solution w. 

Step 4 Starting from w we maximize the PF distance for the modified 
training data Xf . We save the optimal solution denoted by w'. 
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Step 5 Starting from w' we maximize the PF distance for the original 
training data Xt- We save the optimal solution w*. 

Step 6 Updating the control parameters Aa'^, Ami, Am 2 and the current 
training data Xj ; We compare the last two solutions w* saved in Step 5 and for 
the first trial in Step 1 and Step 5. 

(a) If the last two solutions w* are equal, we increase Act^ (deflate more 
strongly the PF distance along w) and update Ami, Am 2 by an expression 
proposed by us in [3,p.294]. Our experience is that an increase of Act^ with 
step-size 0.1 is suitable. 

(b) If the last two solutions w* are different (different local maxima of the 
PF distance have been identified) we update the current training data Xj with 
the modified training data X( (Xj = X(), update the current direction of the 
RCS w with the optimal solution w' for X( (w = w') and restore the initial 
values of the control parameters Acr^ = 0, Ami = 0, Am 2 = 0. 

Then we repeat Steps 3-6. We stop the iterations if several optimal solutions 
w* corresponding to different values of the PF{'W*) (1) are obtained. 

We replicate the proposed procedure (Steps 1-6) starting from different initial 
vectors in Step 1. We choose them by the preliminary principal component and 
Fisher discriminant analysis as we did in [3] and at random which is a usual 
initialization in the conventional optimization. Finally we choose from the vectors 
w* saved in Step 5 those corresponding to large values of PF{w*). We regard 
the selected w*, as ’’interesting” solutions. 

4 An Interactive Run of the Recursive Optimization 
Procedure 

Here we demonstrate the recursive optimization procedure in a run using two 
dimensional synthetic data. We used samples for two classes of the sample sizes 
Nti = Nt 2 = 50, which were drawn from two-dimensional normal mixtures: 
for class wi: 

p(a:i,a:2|wi) = |A([-1.5 0]^, X) + i A([0.5 - 3]^, X) + ifV([-l -3]^,X), 
for class uj 2 - 

p(xi,X2|w2) = |A([-0.5 3]^,X)-hiX([3 0]^,X)-h ifV([0.5 -3]^,X). 

Here, N{[ni /i 2 ]^,X) denotes bivariate normal density with a mean vector 
[hi h 2 ]’^ and a diagonal covariance matrix X = diag {0.1, 0.2). Fig. 2 presents 
the original (sphered) training data Xt. 

For this data we computed the PF distances for 91 equally angled directions 
into the (xi, X 2 )-plane. The solid path ” — ” in Fig. 3 presents PF{w) (1) for the 
vectors w directed under different angles with respect to xi-axis. We observe 
local maxima of PF(w) at angles 19°, 64°, 105°, 128° and 162°. We ran our 
procedure described in Section 3. In Step 1 we chose w* directed under 105° 
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Original Data 




-1.2 -0.7 -0.2 0.3 0.8 1.3 1.8 

Coordinate 1 



Fig. 2. Original data Xt. 



with respect to xi-axis and observed local maximum PF{'W*) = 0.52 (see — in 
Fig. 3). Then we ran Steps 3-6 three successive times keeping = 0, Ami = 0, 
Am2 = 0. 

Here we analyze the result obtained in the first run of the Steps 3-6. In Fig. 4 
we present the transformed data X/ obtained in Step 3. Comparing X/ (Fig. 4) 
with the original data Xt (Fig. 2), we observe that a significant class overlap 
was gained along the direction under 105° for the transformed data X/. This is 
a desired result because our goal was to deflate the local maximum of the PF 
distance at 105° in order to direct the local optimizer to another solution. We 
calculated PF{w) for X/ using different directions of w and show the PF-path 
”. . . ” in Fig. 3. We observe that our procedure eliminated the maximum at 105° 
and smoothed the shape of the PF distance in the range 45° -180° causing some 
restructuring of its shape. It seems reasonable to search for other data transfor- 
mations which cause less restructuring of the PF distance. In [4] we proposed 
a neural network implementation of the RCS which by performing highly non- 
linear data transformation decreases the restructuring of the PF distance, but 
its complexity is higher than that of the procedure proposed in Section 3. 

In the second and third iterations of Steps 3-6 we computed the PF distances 
for the successively transformed data X/.The PF-paths are shown in Fig. 3. The 
local maxima of these paths at 83°, 15° and 152° defined the starting points of 
the optimizer of PF distance for the original data Xt in Step 5. Using them our 
procedure converges to the solutions at 64°, 19° and 162° for the original data 
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” — ” (see Fig. 3). We found the two largest local maxima of the PF distance at 
19° and 64°, which are located far away from the starting initialization 105° 
used in Step 1. The latter can not be obtained by conventional optimization. 

5 A Comparative Study 

Here we compare the discrimination qualities of the discriminant vectors w* 
obtained by our recursive procedure (Section 3) and those obtained by the con- 
ventional optimization with a random initialization of the starting directional 
vectors. 

We ran experiments with observations drawn from six- dimensional distri- 
butions p(x|wi) = p{x\, X 2 \uJi)p{xz, X 4 \ijJi)p{xz\uii)p{xQ\ijJi) for i = 1,2. Here the 
densities were constructed with the following mixtures of the normal distribu- 
tions: 

for class wi: 

p{xuX2\ui) = ifV([0 l]^,I) + iiV([5 3]^,I) + ifV([0 6]^,I) 
p(x3,X4|wi) = iiV([-3 0]^,0.01I) -h iiV([0.5 3]^,0.01I) 

-h 3iV([-0.5 -3]^,0.01I) 
for class UJ 2 ' 

p{xuX2\u2) = ifV([0 3]^,I) + iiV([5 6]^,I) + ifV([-5 6]^,I) 
p{x3,XA\iJ2) = iiV([-0.5 3]^,0.01I)-hiiV([3 0]^,0.01I) 

-h giV([0.5 -3]^,0.01I) 

and p{x^\uji) = p{xQ\uJi) = A^(0, 1). The classes were totally overlapped in the 
(xs, xe (-plane, partially overlapped in the (xi, a;2 (-plane and totally separated 
in the {x^, a;4(-plane. We chose the data having several local maxima for PF(w) 
(1(. We observed two local maxima of PF{w) into the (xi, X2 (-plane and several 
local maxima into the (0:3, 0:4 (-plane including the global maximum of PF(\v). 
We set Nti = Nt 2 = 50. 

We carried out 150 runs of our procedure, starting from different initial di- 
rectional vectors in Step 1 . The components of the initial vectors were drawn at 
random from fV(0, 1(. 

We compared the discrimination quality of w* in Step 1 and Step 5. In 
Step 1 we carried out the conventional optimization with random initialization 
of the starting directional vector while in Step 5 we employed our recursive 
optimization. 

We evaluated the discrimination qualities of w* by the resulting values of the 
PF(w*( computed for the test (extra, validation( observations (500 per class(. 
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In Steps 1, 4 and 5 we maximized PF(w) (1) by a sequential quadratic 
programming method (routine E04UCF in the NAG Mathematical Library). 
We set the number of major iterations of the optimization routine E04UCF to 
50. This setting was proved to be appropriate by a preliminary test. 

We set — Ami = 0 and ± = 1 for i=l,2 in (4) and (5) 

for all runs. This setting implies that the class- conditional densities of q\ (4) 
and (5) are fV(0, 1) which results in a modified training data X/ ( Step 3) 
with an approximately full overlap of the classes for the previously defined dis- 
criminant directions. This certainly eliminates the local maximum of PF(w) at 
the previous solutions but it causes a large restructuring of Xj' which is highly 
unfavorable to our procedure. 

We carried out three successive runs of Steps 3-6. In order not to favor our 
procedure by expanding the number of iterations in the optimization, we re- 
ran Step I with an extended number of the major iterations of the optimization 
routine E04UCF. We set it to 50 x 2 x IV(gtep 53 _ 6 ), with fV(giep 53 _ 6 ) the number of 
repetitions of Steps 3-6 (in our experiments Nf^stepsS-e) = 3). In the comparison 
we used the largest value of PF{w*) obtained in Step 1. 

We studied the situations (initial directional vectors) in which the conven- 
tional optimization failed with the value of PF{w*) smaller then 0.35 (dashed 

path ” ” in Fig. 5). The solid path ” — ” in Fig. 5 presents the results obtained 

by our procedure. The dots in the bottom of Fig. 5 indicate the sequential num- 
ber of the iteration which implies the largest value of PF{w*) ( • - first, •• - 
second and • • • - third iteration). The dots which are missing indicate a case 
(random initializations 45, 55, 94) for which the conventional optimization was 
better then our procedure. Our recursive optimization (solid path) outperforms 
the conventional optimization in Step 1 (dashed path) for the most of the ini- 
tializations. 

We summarize the overall shape of the PF distance over the 100 replications 
by the boxplots shown in Fig. 6 . The boxplot in the left presents the values of 
PF(w*) for the optimal solutions w* obtained by the conventional optimization, 
the central boxplot illustrates the PF(w*) for w* obtained by our recursive op- 
timization procedure, and the boxplot in the right presents the paired difference 
of the values of PF{w*) (the difference of the solid and dashed paths of Fig. 5) . In 
Fig . 6 the boxes show the values of the PF distances between quartiles; the lines 
represent the medians of the PF distances. Whiskers go out to the extremes of 
the PF distances. We observe that the values of PF{w*) of our procedure tend 
to be larger than the values of PF{'W*) of the conventional optimization. 

Finally we calculated the averaged difference of the values of PF{w*) ob- 
tained by our recusive optimization and by the conventional optimization, which 
was 0.24. We evaluated the significance of this difference by the paired t-test and 
obtained the 99 percent confidence interval 0.24 ± 0.08 which confirms a signifi- 
cant increase of PF(w*) for w* obtained by our procedure. 
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Fig. 5. Random initializations in which conventional optimization failed with 
the value of PF(w*) smaller than 0.35. 
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Fig. 6. Boxplots of the values of the PP(w*) for w* obtained by the conventional 
optimization and our recursive optimization. 
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6 Summary and Conclusion 

We have discussed a method for the nonparametric linear discriminant analy- 
sis proposed by us in [3] previously. It searches for the discriminant direction 
which maximizes the Patrick-Fisher (PF) distance between the projected class- 
conditional densities. Since the PF distance is a highly nonlinear function, a 
sequential search for the directions corresponding to several large local maxima 
of the PF distance has been used. 

In this paper we refine our method [3]. We ensure optimization in a batch 
mode instead of optimization in an interactive mode proposed in [3] . By means 
of a simulation (Section 4) we have demonstrated that our procedure succeeds in 
finding large local maxima of the PF distance which are located far away from 
the starting point, and can not be found by conventional optimization. The 
comparative study considered in Section 5 shows that our procedure is more 
successful than the conventional optimization with random initialization. 
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Abstract. In many supervised classification problems there is ambigu- 
ity about the definitions of the classes. Sometimes many alternative but 
similar definitions could equally be adopted. We propose taking advan- 
tage of this by choosing that particular definition which optimises some 
additional criterion. In particular, one can choose that definition which 
leads to greatest predictive classification accuracy, so that any action 
taken on the basis of the predicted classes is most reliable. 



1 Introduction 

This paper is concerned with supervised classification problems, although the 
ideas described here may be applied more widely. In a supervised classification 
problem one has available a design set of data, containing values of variables 
describing the objects in a sample from the population of interest, as well as 
labels indicating to which of a set of classes each of the sampled objects belongs. 
The aim is to use this design set to construct a classification rule which will 
permit new objects to be assigned to classes purely on the basis of their vectors 
of measurements. For simplicity, in this paper we will restrict ourselves to the 
two class case. 

Almost without exception, the formulations of such problems assume that 
the classes are well-defined. There has been some work on problems in which the 
class assignments in the design set may be made with error, but virtually none 
on problems in which there is ambiguity, uncertainty, or confusion about what, 
precisely, is meant by the different classes. This is the problem we consider in 
this paper. 

Such problems are, in fact, surprisingly common. We conjecture that they 
may even be more prevalent than the ‘standard’ form of problem, where the 
classes are well-defined, but that researchers have tended to ignore the problems 
for practical operational reasons: one often needs a crisp classification into one 
of the possible classes so that some appropriate action can be taken (such as 
medical treatment or personnel authorisation, for example). 

In this paper we focus mainly on the subclass of such problems in which 
the true classes are described by partitioning one or more underlying variables, 
and where the uncertainty and ambiguity arises because the positions of the 
partitioning thresholds are not definitively fixed. Here are some examples: 
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— Student grades are often based on performance in examinations: a score of 
more than 70% may mean a student is assigned a grade A, a score between 60 
and 70 may mean a grade B, and so on. Here the choices of 60 and 70 are not 
absolutes, but represent an arbitrary choice. It may be entirely reasonable 
to choose different values. 

— When people apply for bank loans a predictive model is often built to decide 
whether or not they are likely to be good or bad risks. A ‘bad risk’ may 
be defined as one who is likely to become more than a certain number of 
months in arrears during the course of the loan. To produce well-defined 
classes (necessary so that an operational decision — accept or reject — can 
be made) a particular value has to be chosen for this ‘certain number’. Thus 
an individual who is likely to fall more than three months in arrears might 
be regarded as bad. However, this choice — three — is somewhat arbitrary. 
One might, with almost equal justification, have chosen two or four. Indeed, 
as external circumstances change (the economic climate, the competitive 
banking environment, etc.) so one might prefer some different value to use 
in the definition of good and bad. 

— More generally, still within the banking context, the good and bad classes 
may be defined in terms of several variables, for each of which a threshold 
must be chosen. Perhaps a bad bank account is one which is overdrawn by 
more than amount ti, or by an amount t 2 in conjunction with a maximum 
balance during the past three months of less than . Here all of the thresh- 
olds need to be chosen to define the classes, but there is nothing absolute 
about them. 

From the perspective in which the class definitions are fixed and immutable, 
perhaps the most obvious strategy to tackle problems of the kind illustrated 
above is to develop models which are either invariant to a range of possible 
definitions of the classes or which can easily (even automatically) be adjusted 
to cope with different definitions. We have explored such models in Kelly and 
Hand(|5]), Adams et al. (Q), and Hand et al. (@|). In particular, we looked at 
two classes of models. In one (Adams et ah, P, and Hand et ah, ^), we seek 
to predict the value that an individual is likely to have for the ‘thresholding’ 
variables (those variables on which thresholds are imposed to defined the classes), 
so that these predicted values can be compared with any chosen thresholds to 
produce a predicted classification. Note that the thresholds do not have to be 
specified at the time the model is built, but can be chosen immediately prior 
to the time at which the assignment to a class is desired. In the other class of 
models (Kelly and Hand, |0|), we model the probability that an individual is 
likely to have each value on the thresholding variable (s), so that we can estimate 
the probability that an individual will have a value greater than any chosen 
threshold. Again the threshold need not be specified in advance. 

Such approaches are all very well, but the intrinsic arbitrariness (within lim- 
its, anyway) of the class definitions opens up a more radical possibility. If one 
is prepared to accept that alternative choices of the threshold may be equally 
legitimate, or at least that it is difficult to defend the position that one choice is 
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superior to another (within limits, again), then perhaps advantage can be taken 
of this freedom of choice. In particular, perhaps one can choose the threshold(s), 
and hence the definitions of the classes, to optimise some criterion related to 
the performance of the classification rule. This idea is illustrated in Section 0 
In Section 0 we point out that the situation is rather more subtle than might at 
first appear. Depending upon the measure used, one might expect performance 
to appear to improve, simply because of the changing threshold. This might mean 
that the apparent improvement is deceptive. 



2 Taking Advantage of Uncertainty in the Class 
Definitions 

In what follows, for convenience we shall call the variables used to define the 
classes the ‘definition variables’ and those used for the prediction the ‘predictor 
variables’. Traditional supervised classification problems are asymmetric, in the 
sense that one is trying to predict the definition variable(s) from the predictor 
variables, and the issue is merely how to combine the values of the latter to 
yield a prediction. In our situation, however, the problem is more symmetric. 
Although one is still trying to predict the definition variable(s) from the predictor 
variables, one can now also choose how to combine the former so as to maximise 
the performance criterion of interest. This opens up a number of questions. For 
example, as we shall see below in Section 0 the choice of performance measure is 
critical, with different measures leading to qualitatively different kinds of results. 

We shall suppose here that the classes are defined by imposing thresholds on 
several definition variables and combining the resulting intervals to yield classes 
— in the manner of the third example in the opening section. The example we 
will use commences with a baseline definition currently adopted by a bank (for 
commercial reasons the definition we are using is slightly different from that in 
everyday use). In this baseline definition, a bank account is ‘bad’ in a particular 
month if 

(a) the excess amount overdrawn beyond the nominal limit is greater 

than £500; 

OR 

(b) this excess is greater than £100 AND the maximum balance over the 

course of the month is less than £0; 

DR 

(c) the total credit turnover in the month is less than 10% of the month’s 

end balance. 

A ‘good’ account is defined as the complement of this. 

Now, discussions with the bank show that the choices of the threshold of 
£500 in (a), £100 and £0 in (b), and 10% in (c) are somewhat arbitrary. A 
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value of £510 or £490 in the first would be equally legitimate in the definition. 
Of course, one may prefer £500 on aesthetic grounds, but that is hardly an ap- 
propriate argument in the competitive world of retail banking. Likewise, £500 
is a convenient threshold for human comprehensibility but, since all of the nu- 
merical manipulations will be done by a computer, this seems hardly relevant. 
Of course, one can go to an extreme. It is entirely possible that £5000 would 
be an inappropriate threshold in (a). On this basis, one can define intervals of 
acceptable thresholds and, more generally, an acceptable region in the space of 
the definition variables which includes all sets of threshold values which would 
lead to acceptable definitions. (This may not be the simple product of the indi- 
vidual acceptable intervals since acceptable values of one variable may change 
according to the values of another variable.) Our aim is now to choose that point 
within this acceptable region which optimises the performance measure. 

Performance measures are discussed in more detail in Section 0 Here we 
simply take the Gini coefficient, a commonly used measure in retail banking 
applications (Hand and Henley, 0). This is a measure of the difference between 
two distributions, taking values between 0 and 1, larger values indicating better 
performance. It is defined as twice the area between the curve and the diagonal 
in a ROC curve. In our context these are the distributions of the estimated 
probabilities of belonging to the good class for the true good and bad classes 
respectively. Our aim, then, is to examine different definitions of these classes, 
calculate classification rules for them, and see how well these rules do in terms of 
Gini coefficient. We used logistic regression as the classification rule, and ten- fold 
cross-validation to evaluate the Gini coefficients. 

Our data set consisted of 7956 bank accounts, and the acceptable region was 
defined by thresholds as follows (in fact, in this case, the acceptable region is 
the product of the acceptable intervals): 

(a) ti = (excess amount overdrawn beyond the nominal limit) G [200, 800] 

(b) t 2 = (excess amount overdrawn beyond the nominal limit) G [50, 600] ; 

^3 = (maximum balance over the course of the month) G [—150, 150] 

(c) ^3 = (total credit turnover in the month -G month’s end balance) G 

[0.05,0.50] 

One could simply apply a maximisation routine to find the definition within 
this region which optimises the Gini coefficient. However, so as to provide us 
with some insight into the behaviour of the model we evaluated the Gini at 
each point of a grid spanning the acceptable region (producing 5880 possible 
definitions of a bad account). 

Table Eshows four sample definitions and the resulting Gini coefficients. The 
first definition is that given at the start of the section — the slightly modified 
version of that currently used by the bank. It is clear from this table that Gini 
coefficients substantially greater than that currently obtained can be achieved 
— a difference of 0.05 is very important in the retail banking context, and can 
translate into millions of pounds. 

Figure [Dshows a histogram of the Gini coefficients of the models for all 5880 
definitions we tested. It is clear that many definitions permit models yielding 
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Table 1. Thresholds yielding four alternative definitions of ‘bad account’. 



Definition 


ti 


t2 


ts G 


Gini 


1 


500 


100 


0 0.10 


0.41 


2 


400 


150 


-50 0.05 


0.46 


3 


200 


100 


150 0.10 


0.36 


4 


600 


400 


0 0.05 


0.61 



to -I 




0.4 0.5 0.6 

Gini Coefficient 



Fig. 1. Histogram of the Gini coefficients for the logistic regressions applied to 
the 5880 definitions. 



Gini coefficients substantially greater than that achieved using the bank’s current 
definition. 



3 Is Improved Performance an Artefact? 

Figure Q] is striking — but it is rather deceptive. Figure El shows a scatterplot of 
the proportions of the sample defined as bads by each definition against the Gini 
coefficient of the classifier built based on that definition. A negative correlation is 
clear: in general, a higher Gini coefficient is achieved by defining fewer accounts 
as bad. Of course, this is not a rigid relationship. For a given bad rate there are 
many different definitions, each associated with a different Gini coefficient; this is 
shown by the fact that a horizontal line at any given bad rate is associated with 
multiple classifiers and Gini coefficients. This means that there is still value in the 
notion of changing the definitions in the hope of improving class predictability. 
But it does mean that one has to be wary of the improvement arising simply 
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as an artefact of such phenomena as changing priors. In this section we look at 
some aspects of such issues in more detail. 

Many different measures of performance are used for assessing supervised 
classification rules (see Hand, 0, for a discussion). This is because different 
problems have different objectives. Some measures (the Gini coefficient is an 
example) focus on measuring the difference between the overall distributions of 
the estimated probabilities of belonging to class 0 for each of the two classes. 
Others reduce this to a comparison between summarising statistics of the two 
distributions (for example, the difference between their means, or this difference 
standardised by the average standard deviation of the two distributions). Yet 
others apply the threshold to produce the classification, and classify the objects 
into predicted classes, basing measures of performance on the resulting confusion 
matrix (for example, misclassification rate, cost-weighted misclassification rate, 
or the proportion of class 1 objects amongst those classified into class 0). 




Fig. 2. Scatterplot of the bad rates for the 5880 definitions against the Gini 
coefficients of the logistic regressions. 



Some measures — the Gini coefficient and the difference between the means, 
for example — are independent of the sizes of the classes. Others, however (for 
example, the misclassification rate, cost-weighted or not), are not. The stan- 
dardised difference between the means will depend on the priors if the standard 
deviations of the two distributions are different, since then the larger class will 
dominate the estimate of the average standard deviation. If the distributions of 
the estimated probabilities are each normal for each class, and if they remain un- 
changed even though the priors alter, then the misclassification rate of the Bayes 
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Error Rate Bad Rate Amongst Accepts 




Fig. 3. Misclassification (error) rate and proportion classified into class 0 which 
are misclassified when, in both cases, 80% are classified as belonging to class 0. 



classifier (that is, the minimum achievable misclassification rate) will decrease 
monotonically with the absolute difference between the priors (equal priors being 
associated with maximum error rate). On the other hand, (a) the misclassifica- 
tion rate obtained when fixed proportions are classified into each class, and (b) 
the proportion classified as class 0 when fixed proportions are classified into each 
class show the behaviour illustrated in Figure 0 This figure is based on univari- 
ate normal distributions for each class, both with unit standard deviation, the 
first with mean 0 and the second with mean ^ = 0.5, 1.0, 1.5, ..., 3.0 (with ^ = 0.5 
being the top curve). In this example, 80% are classified as class 0 and the hor- 
izontal axis shows the prior for class 1. With this as the performance measure, 
we see that misclassification rate may not even be monotonic with increasing 
prior. 

So much for behaviour as priors alone vary. Now what about behaviour as a 
threshold varies? 

To get an analytical handle on things, suppose we have a bivariate normal 
population (a;, y), with correlation p, and that we partition the y distribution 
at some threshold t, so that we are interested in a comparison between the 
distributions of x for objects whose y values are above and below this threshold. 
To measure performance, we use the ratio of the squared difference between the 
means to the weighted average variance within the two classes, where the weights 
are their relative sizes. Then (see Hand et ah, 0) the squared difference between 
the means is 

p2(l — p)^ 27T 

where p is the proportion of the population which has y < t, and the average 
variance is 



27t p(1 — p) 
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SO that the performance measure is 



pV27r 

— p{l — p)p'^/2'k 



This increases without bound as t increases. (The numerator is constant. The 
second term in the denominator tends to zero as t tends to infinity because of 
the relationship between p and t. The first term in the denominator is the square 
of p(l — p)e* which is less than 

(l_p)e*V2 = 

Jt 



which is, in turn, bounded above by 



y/2 






which tends to zero as t tends to infinity.) 

That is, as the definition threshold moves so that one of the classes becomes 
smaller and smaller, so this measure of performance improves. This may explain 
the pattern observed in Figure El 

The above example assumed a bivariate normal distribution for the joint 
distribution of the estimated probability of belonging to class 0 (represented 
by a;) and the distribution of the variable on which a threshold was imposed 
to define the classes. This meant that the conditional distribution of x had a 
mean which was a linear function of y. When this does not hold, the behaviour 
observed above may not arise. 

To illustrate we take the difference between the means of the x distributions 
(the distribution of x for objects with y above the threshold, and the distribution 
for objects with y below the threshold) as the measure of performance. Different 
choices of the y threshold are likely to yield different values for the means of the 
X distributions. The values will depend on f{y), the marginal distribution of y, 
and M{x\y), the expected value of x given y. The difference between the means 
is ^ 

S=-[ M{x\y)f{y)dy- [ M{x\y)f{y)dy 

where p = f{y) dy. Since this relationship is not affected by arbitrary mono- 
tonic increasing transformations of y, we may take that transformation on which 
/(y) is uniform. For example, if the distribution of y values is normal we may 
work with u = ^(y), ^ being the cumulative normal distribution, yielding 

1 1 

S=- M {x\^~^ {u)) du / M{x\^~^{u)) du 

P Jo 1 - P 7d (t) 



as the difference between the means. From this, it is obvious that the difference 
between the means depends on M{x\^^^{u)). Taking M(a;|$^^(it)) to be a 
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linear function of u results in a constant difference. On the other hand, taking 
M{x\^~^{u)) — (which arises if we take the mean of the conditional 

distribution of x to be equal to y and take the distribution of y to be normal) 
then the difference between the means increases as t moves towards the extremes 

— as in the bivariate normal case above. Finally, suppose that M(a;|4> 1 (m)) 

decreases linearly with u up to some point u = T and is constant for u beyond 
that point. This corresponds to conditional distributions f{x\y) which are the 
same for y > Now, if we take the threshold t greater than and 

increasing, the mean of the distribution for x values for objects with y greater 
than t remains constant, while the mean of the other approaches it as t increases 

— performance will decrease as t increases. 

We see from these examples that performance can vary in different ways, de- 
pending on the underlying distributions and on the particular measure adopted. 
In particular, if the measure of performance is monotonically related to the po- 
sition of a threshold, then the chosen definition of the classes will be at the edge 
of the region of acceptable definitions. This observation can simplify the process 
of optimising the performance measure. 



4 Conclusion 

We began with the premise that in many supervised classification problems there 
is intrinsic wooliness about the definition of the classes. In some cases this arises 
because the problem is dynamic, so that important factors can change over time, 
in others it arises because the classes are defined in terms of variables which are 
proxies for the real interests, and in yet others it arises because there is no real 
sense in which one definition is substantively better or more appropriate than 
another closely related definition. In such situations we suggest that advantage 
can be taken of the looseness of the definition of the classes by choosing that 
particular definition which optimises some additional criterion. If predictive clas- 
sification accuracy is adopted as the criterion, such an approach means that one 
can have more confidence in the accuracy of one’s predictions and conclusions. 

Clearly such a strategy is not universally applicable. It is not appropriate if 
the classes are defined in a rigorous manner: for example, if a medical diagnosis 
is in terms of the definitive presence or absence of a tumour. It is only legitimate 
when there is some freedom to decide precisely what one means by the different 
classes. 

Our practical example shows that significantly improved classification accu- 
racy can be achieved by this method. In the banking situation, this means that 
the bank could base subsequent decisions and operations on the two classes, with 
greater confidence that the individual accounts really did lie in the predicted 
class. However, as we demonstrated in Section 0 the improvement in predictive 
classification performance may sometimes be a relatively simple consequence of 
the relationship between the definition variable, on which the threshold is im- 
posed, and the distributions of estimated class membership probabilities. Some- 
times this will mean that the optimum of the performance measure is located at 
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the boundary of the region of acceptable definitions. This is, of course, precisely 
the place where the ‘acceptability’ is weakest. 

The example in Sectional used classification performance (in fact, Gini co- 
efficient) as the criterion which was optimised by the choice of definition (as 
well as the predictive rule). However, other criteria could be used, and some- 
times it is advantageous to use measures different from that which will be used 
to measure performance accuracy. Moreover, a radically different approach to 
defining the classes would be to formulate a linear combination (instead of a 
logical combination) of the definition variables, imposing a threshold on this to 
define the classes. This might be seen as a classical statistical approach. Putting 
these two suggestions together leads to the idea of measuring predictive power 
in terms of the multiple correlation coefficient between the predictor and defini- 
tion variables. That is, we could use canonical correlations analysis to find that 
linear combination of the predictor variables, and that linear combination of the 
definition variables which are maximally correlated. This leads to a qualitatively 
different kind of definition for the classes from that used in Section El but one 
which may make perfectly sound sense, especially if the predictor variables are 
monotonically related to the perceived difference between the classes. 
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Abstract. This paper describes a clustering methodology for temporal 
data using hidden Markov model(HMM) representation. The proposed 
method improves upon existing HMM based clustering methods in two 
ways: (i) it enables HMMs to dynamically change its model structure to 
obtain a better ht model for data during clustering process, and (ii) it 
provides objective criterion function to automatically select the cluster- 
ing partition. The algorithm is presented in terms of four nested levels of 
searches: (i) the search for the number of clusters in a partition, (ii) the 
search for the structure for a fixed sized partition, (iii) the search for the 
HMM structure for each cluster, and (iv) the search for the parameter 
values for each HMM. Preliminary experiments with artificially gener- 
ated data demonstrate the effectiveness of the proposed methodology. 



1 Introduction 

Unsupervised classification, or clustering, assumes data is not labeled with class 
information. The goal is to create structure for data by objectively partitioning 
data into homogeneous groups where the within group object similarity and the 
between group object dissimilarity are optimized. Data categorization is achieved 
by analyzing and interpreting feature descriptions associated with each group. 
The technique has been used extensively by researchers in discovering structures 
from databases where domain knowledge is not available or incomp]ete[21|[T7j. 

In the past, the focus of clustering analysis has been on data described with 
static features|2I]|r^|Sl|ES], i.e., values of the features do not change, or the 
changes are negligible, during observation period. In real world, most systems are 
dynamic and often are best described by temporal features whose values change 
significantly during observation period. Clustering data described with temporal 
features aimed at profiling behavior patterns for dynamic systems through data 
partitioning and cluster interpretation. Clustering temporal data is inherently 
more complex than clustering static data. First, the dimensionality of the data is 
significantly larger in temporal case. When data objects are characterized using 
static features, only one value is present for each feature. In temporal feature 
case, each feature is associated with a sequence of values. Also, the complexity of 
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cluster definition(modeling) and interpretation increases by orders of magnitude 
with dynamic data,|27|. 

Time series may be considered similar if they share a similar global shape 
or some special local features, or have high correlation. We take a model based 
approach : time series are considered similar when the models characterizing 
individual series are similar. Models are similar when the probability of data 
generated by one model given the other model is high, and vise versa. We assume 
data has Markov property, and may be viewed as the result of a probabilistic 
walk along a fixed set of states. When states can be defined directly using feature 
values, a Markov chain model representation may be appropriate £3 ■ When the 
state definitions are not directly observable, or it is not feasible to define states 
by exhaustively enumerating feature values, they can be defined in terms of 
feature probability density functions. This corresponds to the hidden Markov 
model methodology. In this paper, we focus on temporal pattern generation 
using hidden Markov model representation. 

A HMM is a non-deterministic stochastic Finite State Automata(FSA). The 
basic structure of a HMM consists of a connected set of states, S = {Si,S 2 , S'„). 

We use first order HMMs, where the state of a system at a particular time t 
is only dependent on the state of the system at the previous time point, i.e., 
P{St\St-i, St- 2 , Si) = P{St\St-i). A HMM of n states for data having m 
features can be characterized in terms of three sets of probabilities: (i) the ini- 
tial state probabilities, □ of size n, defines the probability any state being the 
initial state of a series, (ii) the transition probability matrix, A of size nxn, de- 
fines the probability of going from any one state to another state, and (iii) the 
emission probability matrix, B of size nxm, defines the probability of generating 
feature values at any given stated- We are interested in building HMMs for 
continuous temporal sequences. The emission probability density function(p<i/) 
within each state is defined by a multivariate Gaussian distribution character- 
ized by its mean vector, and co-variance matrix, B^- An example of a first 
order continuous density HMM with 3 states is shown in Figure Q The tt^s are 
the initial state probabilities for state i. The a^s are the transition probabilities 
from state i to state j and the (^i, Si)s define the pdjh for emission probabilities 
for state i. 

There are a number of advantages in the HMM representation for our tem- 
poral pattern generation problem: 

— The hidden states of a HMM can be used to effectively model the set of 
potentially valid states of a dynamic process. While the set of states and the 
the exact sequence of states going through by a dynamic system may not be 
observed, it can be estimated based on observable behavior of the systems. 

— HMMs represent a well-defined probabilistic model. The parameters of a 
HMM can be determined in a well-defined manner, using methods such as 
maximal likelihood estimates or maximal mutual information criterion. 

— HMMs are graphical models of underlying dynamic processes that govern 
system behavior. Graphical models may aid the interpretation task. 
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Fig. 1. An Example 3-State Continuous Density HMM 



2 Proposed HMM Clustering Methodology 

Clustering using HMMs was first studied by Rabiner et al. 0 for speech recog- 
nition problems. The idea has been further explored by other researchers in- 
cluding Lee Dermatas and Kokkinakis Lee m, Kosaka et al. (2DI) 
and Smyth |‘2ti] . Two main problems that have been identified in these works 
are: (i) no objective criterion measure is used for determining the size of the 
clustering partition, and (ii) uniform, pre-specified HMM structure is assumed 
for different clusters of each partition. This paper describes a HMM clustering 
methodology that tries to remedy these two problems by developing an objective 
partition criterion measure based on model mutual information, and by devel- 
oping an explicit HMM model refinement procedure that dynamically modify 
HMM structures during clustering process. 

The proposed HMM clustering method can be summarized in terms of four 
levels of nested searches. From the outer most to the inner most level, the four 
searches are: the search for 

1. the number of clusters in a partition, 

2. the structure for a given partition size, 

3. the HMM structure for each cluster, and 

4. the parameters for each HMM structure. 

Starting from the inner most level of search, each of these four search steps are 
described in more detail next. 

2.1 Search Level 4: The HMM Parameters 

This step tries to find the maximal likelihood parameters for the HMM of a 
fixed size. Segmental K-means procedure is employed^ for this purpose. The 
model parameters are initialized using the ViterbiP] heuristic procedure: given 
the current model parameters, the procedure first segments sequence values into 
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different states such that the likelihood of the sequences along a single, unique 
path of the model is maximized, and then state emission definitions and tran- 
sition probabilities are estimated from the sequence value segmentations. The 
core of the Segmental K-means procedure is the iterative Baum- Welch param- 
eter reestimation procedure The Baum- Welch procedure is a variation of 
the more general EM algorithm)^, which iterates between two steps: (i) the 
expectation step(E-step), and (ii) the maximization step(M-step). The E-step 
assumes the current parameters of the model and computes the expected val- 
ues of necessary statistics. The M-step uses these statistics to update the model 
parameters so as to maximize the expected likelihood of the parameters l24l . 
The procedure is implemented using the forward-backward computations. The 
Baum-Weltch procedure is repeated until the difference between the likelihood 
of the two consecutive model configurations is less than a certain threshold. In 
the following experiments, the model convergence criterion is set to 1.0* 10“®. 
Like other maximum likelihood methods, this procedure may end up in local 
maximum values, especially when, in the case of a large HMM, there are a large 
number of parameters involved and the search space becomes very large and 
complex. 

2.2 Search Level 3: The HMM Structure 

This step attempts to replace an existing HMM for a group of objects by a 
more accurate and refined HMM model. Solcke and Omohundro m described 
a technique for inducing the structure of HMMs from data based on a general 
“model merging” strategy m The procedure starts with a very large model 
which has one state defined for each value of each time series. Successfully, 
pairwise states are selectively merged until the posterior probability of the model 
stop to increase. The model that reaches the highest posterior probability is 
retained as the final model. Takami and Sagayama m proposed the Successive 
State Splitting(SSS) algorithm to model context-dependent phonetic variations. 
Ostendorf and Singer further expanded the basic SSS algorithm by choosing 
the node and the candidate split at the same time based on likelihood gains. 
Casacuberta et. al 0 proposed to derive the structure of HMM through error 
correcting grammatical inference techniques. 

Our HMM refinement procedure combines ideas from the past works. We 
start with an initial model configuration and incrementally grow or shrink the 
model through HMM state splitting and merging operations for choosing the 
right size model. The goal is to obtain a model that can better account for the 
data, i.e., having a higher model posterior probability. For both merge and split 
operations, we assume the Viterbi path does not change after each operation, 
that is for the split operation, the observations that were in state s will reside 
in either one of the two new states, go or qi. The same is true for the merge op- 
eration. This assumption can greatly simplify the parameter estimation process 
for the new states. The choice of state(s) to apply the split(merge) operation 
is dependent upon the state emission probabilities. For the split operation, the 
state that has the highest variances is split. For the merge operation, the two 
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states that have the closest mean vectors are considered for merging. Next we 
describe the criterion measure used to perform heuristic model selection during 
HMM refinement procedure. 



Marginal Likelihood Measure for HMM Model Selection Li and Biswas 
m proposed one possible HMM model selection criterion, the Posterior Proba- 
bility of HMM(PPM), developed based on the computation for Bayesian model 
merging criterion in eg. One problem with the PPM criterion is that it depends 
heavily on the base values for the exponential distributions used to compute prior 
probabilities of global model structures of HMMs. 

Here, we present an alternative HMM model selection scheme. From Bayes 
theorem, given data, X , and a model. A, trained from X, the posterior probability 
of the model, P{X\X), is given by: 



P{X\X) = 



PWP{X\X) 

P{X) 



where P{X) and P{X) are prior probabilities of the data and the model respec- 
tively, and P(X|A) is the marginal likelihood of data. Since the prior probability 
of data remains unchanged for different models, for model comparison purpose, 
we have P{X\X) oc P{X)P{X\X). By assuming uniform prior probability for dif- 
ferent models, P{X\X) oc P{X\X). That is, the posterior probability of a model is 
directly proportional to the marginal likelihood. Therefore, the goal is to select 
the model that gives the highest marginal likelihood. 

Computing marginal likelihood for complex models has been an active re- 
search area P] 1231 P2! US!- Approaches include Monte-Carlo methods, i.e., 
Gibbs sampling methods m im, and various approximation methods, i.e., the 
Laplace approximation m and approximation based on Bayesian information 
criterion It has been well documented that although the Monte-Carlo meth- 
ods are very accurate, they are computationally inefficient especially for large 
databases. It is also shown that under certain regularity conditions, Laplace 
approximation can be quite accurate, but its computation can be expensive, es- 
pecially for its component Hessian matrix computation. Next, we describe two 
efficient approximation methods developed for marginal likelihood computation: 
(i) the Bayesian Information Criterion(BIC), and (ii) the Cheeseman-Stutz(CS) 
approximation . 



Bayesian Information Criterion In log form, BIC computes marginal likelihood 
of a model as: 

logP(A|A) = logP(A|A,0A) - ^ log IV, 

where 0\ is Maximum Likelihood(ML) configuration of the model, d is the di- 
mensionality of the model parameter space and N is the number of cases in 
data. The first term in BIC computation, log P(A|A, 0 a), is the likelihood term 
which tends to promote larger and more detailed models of data, whereas the 
second term, — | log N, is the penalty term which favors smaller model having 
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less parameters. BIG selects the best model for data by balancing these two 
terms. 

Cheeseman-Stutz Approximation Cheeseman and Stutz first proposed the CS 
approximation method for their Bayesian clustering system, AUTOCLASS pT|: 

miA) = pmA)^, 

where X' represents complete data, i.e., data with known cluster labels. The first 
term is the complete data likelihood term. An exact computation of this term 
involves integration through all possible parameter configurations of the model: 

P(A'|A) = J d9P{9\X)P{X',9\X), 



where 9 represents model parameter configuration. The integration can be ap- 
proximated by a summation over a set of local maximum parameter configura- 
tions, Eeee, Pi9\X)P{X',9\X) mM- To reduce computation, we have taken 
this approximation further by using a single maximum likelihood configuration, 
9\, 9\ G 9s, to approximate the summation, i.e., P{X'\X) « P(0a|A)P(A', 0a|A). 
The second term in CS approximation is a gross adjustment term. Both its nom- 
inator and denominator are expanded using BIG measure. Ignoring differences 
between the penalty terms in the nominator and the denominator, we obtain: 

logP(A|A) « log P{9\\X) + logP{X\9\,X), 

where X is the incomplete data and P{9\X) is the prior probability of the 
model parameters. We assume that the transition probabilities out of individual 
states follow Dirichlet prior distribution, the feature mean values in each state 
are uniformly distributed, and the variances of each state follow Jeffery’s prior 

distribution^^- 

Apply Approximation Methods to HMM Structure Selection We experimentally 
illustrate how BIG and CS work for HMM structure selection. An artificial data 
set of 100 data objects is generated from a pre-defined five-state HMM. Each 
data object is described using two temporal features. The length of temporal 
sequences of each feature is 50. The same data set is modeled using HMMs of 
sizes ranging from 2 to 10. Results from BIG and CS are given in Figures 3. The 
dotted lines show the likelihoods of data modeled using HMMs of different sizes. 
The dashed lines show the penalty(FigEJa)) and the parameter prior probability 
(Fi^2Jb)) for each model. And the solid lines show BIC(Fig E^a)) and CS(Fig 
ETb)) as a combination of the above two terms. We observe, as the size of the 
model increases, the model likelihood also increases and the model penalty and 
parameter prior decreases monotonically. Both BIG and CS have their highest 
value corresponding to the correct model structure, the 5-state model. 
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Fig. 2. HMM model selection using marginal likelihood approximation methods: 
(a) BIG approximation (b) CS approximation 



2.3 Search Level 2: The Partition Structure 

The two most commonly used distance measures in the context of the HMM rep- 
resentation is the sequence-to-model likelihood measure |3 and the symmetrized 
distance measure between pairwise models |n| • We choose the sequence-to-model 
likelihood distance measure for our HMM clustering algorithm. Sequence-to- 
HMM likelihood, P(0|A), measures the probability that a sequence, O, is gen- 
erated by a given model, A. When the sequence-to-HMM likelihood distance 
measure is used for object-to-cluster assignments, it automatically enforces the 
maximizing within-group similarity criterion. 

A K-means style clustering control structure and a depth-first binary divisive 
clustering control structure are proposed to generate partitions having different 
number of clusters. For each partition, the initial object-to-cluster memberships 
are determined by the sequence-to-HMM likelihood (See Section 2.2.1) distance 
measure. The objects are subsequently redistributed after HMM parameter rees- 
timation and HMM model refinement have been applied in the intermediate 
clusters. For the K-means algorithm, the re-distribution is global for all clusters. 
For binary hierarchical clustering, the redistribution is carried out between the 
child clusters of the current cluster. Thus the algorithm is not guaranteed to 
produce the maximally probable partition of the data set. If the goal is to have 
a single partition of data, K-means style control structure may be used. If one 
wants to look at partitions at various levels of details, binary divisive clustering 
may be suitable. Partitions of different number of clusters are compared using 
the PMI criterion measure, described next. For K-means clustering, the search 
stops when PMI of the current partition is lower than that of the previous par- 
tition. For binary clustering, the search along a particular branch is terminated 
when dividing the current cluster decreases the overall PMI score. 

2.4 Search Level 1: The Number of Clusters in a Partition 

The quality of a clustering is measured in terms of its within cluster similarity 
and between cluster dissimilarity. A common criterion measure used by a number 
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of HMM clustering schemes is the overall likelihood of data given models of the 
set of clusters m- Since our distance measure does well in maximizing the 
homogeneity of objects within each cluster, we want a criterion measure that 
is good at comparing partitions in terms of their between-cluster distances. We 
use the Partition Mutual Information(PMI) measure 0 for this task. 

From Bayes rule, the posterior probability of a model. A,, trained on data, 
Oi, is given by: 



m) EUp(o.\x,)p{x,y 

where P{Xi) is the prior probability of a data coming from cluster i before 
the feature values are inspected, and P{Oi\Xi) is the conditional probability of 
displaying the feature Oi given that it comes from cluster i. Let Mli represent 
the average mutual information between the observation sequence Oi and the 
complete set of models A = (Ai, ..., Aj): 

Mh = logP(A,|O0 

= log(P(O,|A0P(A0) - logEi^i P( 0 *|A,)P(A,). 

Maximizing this value is equivalent to separating the correct model Xi from all 
other models on the training sequence Oi. Then, the overall information of the 
partition with J models is computed by summing over the mutual information 

Mii 

of all training sequences: PM I = — — — , where rij is the number of 

objects in cluster j, and J is the total number of clusters in a partition. PMI 
is maximized when the J models are the most separated set of models, without 
fragmentation. 

Next, we show how PMI measure is used to derive a good partition with 
the number of clusters and object-cluster membership. To better demonstrate 
the effects of PMI, we illustrate the process using the binary HMM clustering 
scheme, and we assume the correct model structure is known and fixed through- 
out the clustering process in this example. To generate data with K clusters, 
first we manually create K HMMs. From each of these K HMMs, we generate 
Nk objects, each described with M temporal sequences. The length of each tem- 
poral sequence is L. The total data points for such a data set is K ■ Nk ■ M ■ L. 
In these experiments, we choose K = 4, Nk = 30, M = 2, and L = 100. The 
HMM for each cluster has 5 states. 

First, the PMI criterion measure was not incorporated in the binary cluster- 
ing tree building process. The branches of the tree is terminated either because 
there are too few objects in the node, or because the object redistribution pro- 
cess in a node ends with one cluster partition. The full binary clustering tree, as 
well as the PMI scores for intermediate and final partitions are computed and 
shown in Figure Ef a). The PMI scores to the right of the tree indicate the quality 
of the current partition, which includes all nodes at the frontier of the current 
tree. For example, the PMI score for the partition having clusters C 4 and C 123 is 
0.0, and PMI score for the partition having clusters C 4 , ^C 2 , §§<^ 2 , and C 13 is 
— 1.75 * 10^. The result of this clustering process is a 7-cluster partition, with six 
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Fig. 3. The Binary HMM Clustering Tree 



fragmented clusters, i.e., cluster C 2 is fragmented into ^C 2 and |§C 2 , cluster 
C 3 is fragmented into HC 3 , and cluster Ci is fragmented into and 

||Ci. Figure Ofb) shows the binary HMM clustering tree where PMI criterion 
measure is used for determining branch terminations. The dotted lines cut off 
branches of the search tree where the split of the parent cluster results in a de- 
crease in the PMI score. This clustering process re-discovers the correct 4-cluster 
partition. 

3 Experiment 

We generated an artificial data set from three random generative models, each 
of a different size, one with three states, one with four states, and one with five 
states. Based on each model, 50 data objects are created, each described by two 
temporal features. The length of values for each temporal feature is 50. Figure E] 
shows six example data objects from this data set. The dotted lines and the solid 
lines represent values of the two temporal features for each object. It is observed 
that, from the feature values, it is quite difficult to differentiate which objects 
are generated from the same model. In fact, objects (a) and (f) are generated 
from the three-state HMM, objects (b) and (e) are generated from the four-state 
HMM, and objects (c) and (d) are generated from the five-state HMM. Due to 
space limitations, detailed parameters of these three models are omitted. 

Given this data, our method successfully uncovers the correct clustering par- 
tition size, i.e., 3 clusters in the partition, and individual data object is assigned 
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Fig. 4. Compare Data Objects Generated from Different Models 



to the correct cluster, i.e., the cluster whose derived model corresponds to the 
object’s generative model. Furthermore, for each cluster, our method accurately 
reconstructed the HMM with the correct model size and near perfect model 
parameter values. 



4 Conclusion 

We presented a temporal data clustering methodology based on HMM represen- 
tation. HMMs have been used in speech recognition problems to model human 
pronunciations. Since the main objective in that study is recognition, it is not 
essential whether the true model structure is uncovered. A fixed size model 
structure can be used throughout data analyses, as long as the model structure 
is adequate in differentiating objects coming from different underlying models. 
On the other hand, in our case, HMMs are used to profile temporal behaviors 
of dynamic systems. Our ultimate objective is to characterize behavior patterns 
of dynamic systems by interpreting the HMMs induced from temporal data. 
Therefore, it is extremely important that the derived models are as close to the 
underlying models as possible. To facilitate this, we introduced a dynamic HMM 
refinement procedure to the clustering process and employed an objective mea- 
sure, BIG, for model selection purposes. Furthermore, we have developed the 
PMI criterion measure for selecting the partition size. This allows an objective 
and automatic clustering process which can be very useful in many discovery 
tasks. 

Our next step is to apply this method to real world problems. The application 
domain we are currently studying is about pediatric patients having Respiratory 
Distress Syndrome(RDS) and undergoing intensive hospital care. The goal of this 
application is to identify patient response patterns from temporal data recorded 









Temporal Pattern Generation 255 



in the form of vital signs measured frequently throughout a patient’s stay at the 

hospital. 
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Abstract. Case-based reasoning relies on the hypothesis that “similar 
problems have similar solutions,” which seems to apply, in a certain sense, 
to a large range of applications. In order to be generally applicable and 
useful for problem solving, however, this hypothesis and the correspond- 
ing process of case-based inference have to be formalized adequately. This 
paper provides a formalization which makes the “similarity structure” of 
a system accessible for reasoning and problem solving. A corresponding 
(constraint-based) approach to case-based inference exploits this struc- 
ture in a way which allows for deriving a similarity-based prediction of 
the solution to a target problem in form of a set of possible candidates 
(supplemented with a level of confidence.) 



1 Introduction 

The problem solving method of case-based reasoning (Cbr) relies on the assump- 
tion that “similar problems have similar solutions,” subsequently referred to as 
the “Cbr hypothesis.” As an interesting aspect of this hypothesis we would like 
to emphasize that it implies certain structural assumptions of a system under 
consideration. These assumptions, however, are not related to the structure of 
the system directly. Rather, they concern the “similarity structure,” which can 
be seen as a derived structure or a transformation of the system structure. 

Consider, as an illustration, a simple data generating process P which trans- 
forms input values x G X into output values y G Y . In order to explain a set 
{{xi,yi ), . . . , {xn,yn)} of observed data, statistical methods or machine learn- 
ing algorithms typically consider some hypothesis space TL. For X = K" and 
y = E this space might be given, for instance, as the class of linear functions 
h{x) = a\Xi -b . . . -b ttnXn (oi, . . . , € E) . Each of these functions corresponds 

to a certain hypothesis h G Ti.. As in this example, the hypotheses are usually 
related to properties (attributes) of the instances (x,y) G X xY, i.e., attributes 
of the output value are specified directly as a function y = h{x) of the attributes 
of input values. As opposed to this, the Cbr hypothesis postulates a certain 

* This work has been supported by a TMR research grant funded by the European 
Commission. 
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relation between similarity degrees cr(x,x') and a{y,y') associated with pairs of 
instances and, hence, makes structural assumptions about the process P not at 
the system or instance level but at the, say, similarity level. 

Interpreted thus, the Cbr hypothesis applies (at least to a certain extent) 
to many applicational domains. It seems, therefore, reasonable to exploit the 
information contained in the similarity structure of a system. Clearly, a neces- 
sary prerequisite for this is a suitable formalization of the Cbr hypothesis and, 
hence, the process of case-based inference. So far, however, only few attempts 
have been made in this direction mm- In this paper, we develop a formal- 
ization in which we proceed from a constraint-based interpretation of the Cbr 
hypothesis, according to which the similarity of problems imposes a constraint 
on the similarity of associated solutions in form of a lower bound P]. Based on 
this formalization, we propose an approach to case-based inference which seems 
to be particularly well-suited for supporting (well-structured) task types arising, 
e.g., data analysis and problem solving. The focus of the formal model we shall 
propose is on Cbr as case-based inference (Cbi), which essentially corresponds 
to the Reuse process within the (informal) R'^ model of the Cbr cycle PJ and 
emphasizes the idea of case-based reasoning as a prediction method 

The remaining part of the paper is organized as follows: In Section 0 the basic 
framework of case-based inference is introduced, and a constraint-based realiza- 
tion of Cbi is proposed. An application to data analysis and problem solving in 
knowledge-based configuration is outlined in Section 0 Section 0 concludes the 
paper with a summary. 



2 The CBI Framework 

In this section, we shall introduce the basic Cbi framework we proceed from. 
Within this framework the primitive concept of a case is defined as a tuple con- 
sisting of a situation and a result or outcome associated with the situation. The 
meaning of a case might range from, e.g., example-category tuples in data anal- 
ysis (classification) to problem-solution pairs in optimization. We do not make 
particular assumptions concerning the characterization of situations or results. 
Cenerally, an attribute-value representation will be utilized, i.e., situations as 
well as results will be marked as “feature” vectors of (not necessarily numeric) 
attribute values. 

Definition 1 (Cbi set-up). A Cbi set-up is defined as a 6-tuple 

S = {S,TZ,ip,as,(J-Ti,M ) , 

where S is a countable set of situations, TZ is a set of results, and ip S ^ TZ 
assigns results to situations. The functions as : S x S ^ [0,1] and an : TZ x 
TZ — > [0, 1] define similarity measures over the set of situations and the set of 
results, respectively. M. is a finite memory A4 = {(si,ri), (s 2 ,r 2 ), . . . , (s„,r„)} 
of cases c = (s, (f{s)) G S xTZ. Ds resp. Dn denote the sets { 0 - 5 ( 5 , s') | s, s' G 5} 
resp. {an{T{s)i I s, s' G 5} of actually attained similarity degrees. 
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Clearly, the assumption that a situation s G S determines the associated 
outcome r = ip{s) G TZ does not imply that the latter is known as soon as the 
situation is characterized. For example, let situations correspond to instances of 
a class of combinatorial optimization problems. Moreover, define the result asso- 
ciated with the situation as the (unique) optimal solution of the corresponding 
problem. Of course, deriving this solution from the description of the problem 
might involve a computationally complex process. In this connection, we refer 
to case-based inference as a method which supports the overall process of prob- 
lem solving by predicting the result associated with a certain situation. To this 
end, Cbi performs according to the Cbr principle: it exploits experience in form 
of precedent cases, to which it “applies” background knowledge in form of the 
heuristic Cbr hypothesis. 

Definition 2 (Cbi problem). A Cbi problem is a tuple {S,so) consisting of 
a Cbi set-up E and a new situation sq G S. The task is to exploit the sim- 
ilarity structur^ of E in conjunction with observed cases in order to predict 
resp. characterize the result rg = (p(so) associated with sq. 

It should be mentioned that the task of prediction is a very general one and 
contains several task types such as, e.g., classification or diagnosis^ as special 
cases. The fact that a result is a function of a set of observable attributes (the 
situation) is the main characteristic of prediction. 



2.1 CBI as Constraint-Based Inference 

In this section, we will formalize the hypothesis of “similar situations having sim- 
ilar results.” To this end, we adopt a constraint-based interpretation, according 
to which the similarity of situations constrains the similarity of the associated 
results (at a minimum level.) 

Definition 3 (similarity profile). For a Cbi set-up E, the function : 
Ds [0,1] defined by 

hs{x) := inf cr7^((p(s), <^(s')) 

s,s' Go ,(T5 [s^s'j — x 

is called the similarity profile of the set-up E. 

If we refer to the triple (5, TZ, ip) as the system, then ip can be seen as defin- 
ing the system structure (or instance structure.) The similarity profile h^ is the 
“fingerprint” of this structure at the similarity level and (partly) defines the 
similarity structure of the set-up E. It can also be seen as a condensed repre- 
sentation of knowledge concerning the system structure p. Indeed, the domain 
and the range of are one-dimensional, whereas S and TZ will generally be of 
higher dimension. 



^ This term will be specified in Section mu 
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Definition 4 (similarity hypothesis). A similarity hypothesis is identified 
by a function h : [0,1] — *■ [0,1] (and similarity measures ostUti.) The intended 
meaning of the hypothesis h or, more precisely, the hypothesis (h,as,cr'n) is that 

{cTs{s, s') = x) ^ {an{‘p{s),g}{s')) > h{x)) (1) 

holds true for all s, s' G 5. ^ hypothesis h is called stronger than a hypothesis 
h' if h' < h and h h' . We say that a Cbi set-up S satisfies the hypothesis h, 
or that h is admissible, if h(x) < hs{x) for all x G Dg. 

A similarity hypothesis h it thought of as an approximation of a similarity 
profile hs- Thus, it defines a quantification of the Cbr hypothesis for the set- 
up E. Since a similarity profile hs is a condensed representation of the system 
structure (p, a similarity hypothesis h will generally be less constraining than a 
system hypothesis related to ip directly, i.e., an approximation ip \ S ^ Ti oi p. 
On the other hand, a similarity profile has a relatively simple structure which 
facilitates the formulation, derivation, and adaptation of hypotheses. 

Consider a Cbi problem (A, sq) consisting of a set-up S and a new situation 
So- Moreover, suppose that E satisfies the hypothesis h. If the memory M. con- 
tains the situation sq, i.e., if At contains a case (s,r) such that s = sq, then the 
outcome tq = r can simply be retrieved from M. Otherwise, we can derive the 
following restriction: 

^0 G C;i^7Vl(so) := ■^h{as(so,s)){'^) 1 (2) 

(s,r)eM 

where the a-neighhorhood of a result r G TZ is defined as the set of all outcomes 
r' which are at least a-similar to r: Afa{r) := {r' S TZ\aTi{r,r') > a}. Thus, 
in connection with the constraint-based view, the task of case-based inference 
can be seen as one of deriving and representing the set Ch,M (so) in 0 or an 
approximation thereof. This may become difficult if, for instance, the definition of 
the similarity an and, hence, the derivation of a neighborhood is complicated. 
The sets Afa{r) in may also become large, in which case they cannot be 
represented by simply enumerating their elements. 

In the context of Cbi it must generally be assumed that the similarity profile 
hs of a Cbi set-up E is unknown. Consequently, we cannot guarantee the admis- 
sibility of a certain hypothesis h. Nevertheless, suppose that h is indeed a good 
approximation of hs- Then, it seems reasonable to utilize h for deriving a set 
Ch,M{so) according to 0 as an approximation of Cha ,a^(sq) (while keeping the 
hypothetical character of h in mind.) This situation, which reflects the heuristic 
character of Cbi as a problem solving method, is closely related to the aspect 
of learning. In m, we have proposed an algorithm for learning similarity hy- 
potheses from observed cases. It has been shown that corresponding hypotheses 
induce valid predictions, i.e., set- valued approximations H2D which cover rg, with 
high probability. In fact, the probability of an invalid prediction can be made 
arbitrarily small by increasing the size of the memory. 

The overall Cbi process, as introduced in this section, is illustrated in Fig- 
ure nj (a) In a first step, the problem {E, sq) is characterized at the similarity 
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Fig. 1. Illustration of the case-based inference process. 



level by means of its similarity strueture^ consisting of the similarity profile hs 
resp. a corresponding hypothesis h and the similarity structure zs of the (ex- 
tended) memory (AI,so). The latter can be thought of as the set of values 
{cts(so, Sfc) I 1 < A: < n}. In fact, hs resp. zs can be seen as the “image” of the 
system {S, TZ, ip) resp. the (extended) memory (Ai, sq) under the transformation 
defined by the similarity measures as and a-n. (b) The main step of the Cbi pro- 
cess is then to utilize the similarity structure of the problem for constraining the 
unknown outcome rg at the similarity level. The corresponding constraints C are 
implieit in the sense that they refer to the derived property of similarity but not 
to the result directly, (c) By applying the function aij^^ : 7?. x [0, 1] ^ 2^, which 
is inversely related to a-ji via aij^^{r,a) = Afa{r), to the observed outcomes rk 
(1 < fc < n), the similarity eonstraints C are transformed into eonstraints on 
outeomes, which are combined via 0 to a constraint Ch^ ,m resp. Ch,M at the 
system level. 

Of course, the more “convenient” the similarity structure of a set-up U is, 
the more successful Cbi will be. Within our framework, we have quantified this 
convenience, i.e., the degree to which the Cbr hypothesis holds true for the set- 
up S, by means of the similarity profile h^- This quantification, however, may 
appear rather restrictive. The existence of some “exceptional” pairs of cases, for 
instance, might call for small values hs{x) of the similarity profile in order to 
guarantee the validity of ®. Then, the predictions 0 which reflect the success 
of the Cbi process might become imprecise even though the similarity structure 
of S is otherwise strongly developed. In order to avoid this problem and to 
exploit the similarity structure of a system more efficiently we have developed 
a probabilistic generalization of the approach presented in this section. This 
approach extends the definition of a Cbr set-up by endowing the set S with a 
probability measure modelling the occurence of situations. Then, the similarity 
of situations allows for deriving conclusions about the (conditional) probability 
distribution of the similarity of associated solutions, represented by a probabilistie 
similarity profile. A probabilistic formalization of the Cbr hypothesis seems 
appropriate since it emphasizes the heuristic character of Cbr and is particularly 
well-suited for modelling the “exception to the rule.” Details concerning these 
extensions, which are not discussed in this paper, can be found in inng. 
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From a mathematical point of view, the decisive aspect of the inference 
scheme in Fig. Q is the fact that it is based on the analysis of transformed data 
which depicts a certain relation between original observations. Considering these 
observations in pairs, the original data, represented by the memory M. C S xTZ, 
is transformed into the set of data {((T5(s, s'), (T7?,(r, r')) | {s,r) , {s' ,r') G M}. 
As opposed to functional relations related to the instance level, which are of 
the form S ^ TZ, the result h of the analysis of this data provides information 
about the relation aTi{ip{s),ip{s')) between outcomes (p(s), (/?(s'), given the rela- 
tion <75(5,5') between situations s and s'. Then, given an observation (s,r) and 
a new situation sq and, hence, the relation <75(5,59), h is used for specifying 
the relation < 7 Tj,(r, ro) between r and rg = </?(so). Finally, the inverse transfor- 
mation is used for translating the information about r and cr'ji{r,rQ) into 
information about rg itself. Moreover, the combination of evidence concerning 
rg becomes necessary if this kind of information has been derived from different 
observations (si,ri),... ,(s„,r„). 

In our case the relation between observations corresponds to their similarity, 
the function h defines an (estimated) upper bound in form of (an approxima- 
tion of) the similarity profile, and the combination of evidence is realized as the 
intersection of individual predictions. This, however, does not seem compulsary. 
Indeed, we might think of basing inference procedures on alternative specifica- 
tions such as, e.g., <75(5, s') = 5 — s' and < 7 Tj,(r, r') = r — r' . Then, for instance, a 
least squares approximation h of the transformed data provides an estimation of 
the difference of two outcomes, given the difference of corresponding situations. 

2.2 Case-Based Approximation 

Given a hypothesis h and a memory M, 0 can be extended to a set- valued 
function 

Ch,M-S^2'^,s^ f| N^ias{s,s'))ir'), (3) 

{s' ,r')eM 

which is thought of as an (outer) approximation of ip (observe that ip{s) G 
Ch,M{s) for all 5 G 5 if /i is admissible). Moreover, 0 is easily generalized such 
that only the k most similar cases, represented by a memory Mg C Ad, are 
used in order to derive a value Ch.M{s). Thus, (EJ can be seen as an interesting 
set-valued version of the fc-NEAREST Neighbor (A:NN) algorithm. As opposed 
to the latter, o also takes the quality of the similarity structure in connection 
with the prediction task into account. For instance, the function Ch,M will not 
be very constraining if this structure is poorly developed, which indicates that 
the application of the (original) A:NN method does not seem advisable. Besides, 
the above-mentioned approach [I l)j to learning similarity hypotheses allows for 
quantifying the validity of predictions obtained from Ch,M by means of a prob- 
ability bound a such that V{ip{y) G Ch,M-{y)) > I — ct- Thus, given a set Ad 
of observations and the induced hypothesis h, 0 does not only make a set- 
valued prediction of outcomes <^(5) available, but also a corresponding level of 
confidence. 
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The comparison of the two algorithms makes also the difference between rea- 
soning at the system level and reasoning at the similarity level obvious. Namely, 
the A:NN algorithm applies the similarity measures directly to the instances in 
order to find the most similar ones and, hence, to derive a prediction. The sim- 
ilarity measures are used by more indirect means in our method, in the sense 
that they define the similarity structure which is then used (in connection 
with observed cases) for constraining outcomes. 

It is interesting to study the approximation capability of (H. This, however, 
presupposes the system (5, TZ, ip) to have a structure which allows us to quantify 
the quality of an approximation. To this end, let us endow S and TZ with a metric, 
i.e., let (S,ds) and (JZ,d-fi) be metric spaces. Clearly, a good approximation of 
p can only be expected if the similarity measures as and a-ji are somehow 
compatible with the distance measures ds and d-ji. We can prove the following 
result. 

Proposition 1. Suppose that as = f ° ds and a-jz = g o d-jz with strictly de- 
creasing functions f and g. For all e > 0 suppose a finite set S' C S to exist 
such that S = Us'gS'{® ^ 5 1 ^ 5 ( 5 , s') < e}. Moreover, for some L > 0, as- 
sume the Lipschitz condition d']z{p{s),p{s')) < Lds{s,s') to hold on S. Then, 
the function p can he approximated by 0^ to any degree of accuracy in the fol- 
lowing sense: for all i5 > 0, a finite memory A4 exists such that \\Cha ,Ar(s)|| := 
max{d 7 j(r, r') \r,r' G Cfa ,ai(s)} < S for all s G S. 

Proof: Let e > 0 and S' C S satisfy card(5') < 00 and S = Us'gS'{® ^ 
S\ds{s,s') < e}. Moreover, define M := Us'g5' s,s' G S such 
that as{s,s') = x G Ds we have ds{s,s') = f~^{x). Thus, according to our 
assumptions, aTz{p{s),p{s')) > g{Lf~^{x)), which means hs{x) > g{Lf~^{x)) 
for all X G Ds- Now, consider some s G S. From the property of S' follows 
that A4 contains a case (so,ro) such that ds{s,so) < e. Hence, hs{as{s, sq)) > 
g{Lf~^{as{s,so))) > g{Le), which means that dTz{ro,r') < Ls for all r' G 
■A4n (<Ts(s,so))(^o)- The result then follows from d-jz{r,r') < dn{r,ro) -b dn{ro,r') 
for all r,r G (f7'5(s,so))(Fo) and ,a^('5) G1 Tfhu {(7s{s,so))(.‘^o)- Cl 

3 An Application to Combinatorial Optimization 

3.1 Resource-Based Configuration 

Resource-based configuration is a special approach to knowledge-based configu- 
ration. It is based on the idea that a technical system is composed of a set of 
primitive components, each of which is characterized by some set of resources or 
functionalities it provides and some other set of resources it demands. That is, the 
relation between components is modelled in an abstract way as the exchange of 
resources H3|. In its simplest form, a configuration problem is specified as a triple 
{A, y, p), where A is a set of components and y is an external demand of function- 
alities. Each component is characterized by some integer vector a = (oi, . . . , am) 
with the intended meaning that it offers fi, i.e., the i.th. functionality, times 
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ifai>0(l<i<m). Likewise, the component demands this functionality Oi 
times if Ci < 0. The set of components can be written compactly in form of an 
m X n integer matrix which we also refer to as A. The jth column of A cor- 
responds to the vector characterizing the jth component. The external demand 
is also specified as a vector y > 0, and the meaning of its entries yi is the same 
as for the components except for the sign. The (integer) vector p = (pi, . . . ,pn) 
defines the prices of the components, i.e., using the j.th component (once) within 
a configuration causes costs of pj > 0. A configuration, i.e., the composition of a 
set of components, is written as a vector x = (xi, . . . , a;„) with xj > 0 the num- 
ber of occurences of component . A configuration x is feasible if the net result 
of the corresponding composition and the external demand y are “balanced,” 
i.e., A X X = — y- If speak of the quality of a configuration we 

always have its price in mind. Therefore, a feasible configuration (solution) x* 
is called optimal if it causes minimal costs. In its basic form a resource-based 
configuration problem is obviously equivalent to an integer linear program^ 

From an applicational point of view it seems reasonable to assume that con- 
figuration problems have to be solved repeatedly for varying demands y but a 
fixed set of components A and, hence, a fixed set of prices p. In this context, 
the tuple {A,p) is also referred to as the knowledge base, and a configuration 
problem is simply identified by the demand vector y. Obviously, this kind of 
repetitive eombinatorial optimization problem is particularly interesting from a 
case-based reasoning perspective ini- 

As two concrete examples let us consider the following knowledge bases 
(Ai,pi) and (A 2 ,P 2 ): 



/I 


1 


0 


0 






/3\ 




/I 


3 


0 


-1 






(A 


0 


2 


-1 


0 


0 




2 




0 


2 


-1 


0 


0 




1 


0 


0 


2 


0 


1 


,Pi = 


4 


= 


0 


-1 


2 


0 


1 


,P2 = 


3 


0 


0 


0 


1 - 


-1 




1 




0 


0 


0 


1 


-1 




1 


\0 


0 


0 


0 


3^ 








U 


0 


0 


0 


3^ 







In order to obtain Cbi set-ups Si and S 2 , which define corresponding repetitive 
configuration problems, we further formalize these examples within our frame- 
work as follows: 

S ■■= {y= (pi, • ■ . .Ps) I 0 < pi, . . . ,P5 < 6},7^ := Z>o 

o-5(p,p') := exp (^-0.1 X;Li \yk-y'k\) ,^n{r,r') := exp (-0.1 |r-r'|) 

p(p) := min {x x p\x & 7 j>q, A x x > y} 

That is, we consider demand vectors as situations, where the demand of a 
single functionality is at most sixfl The result associated with a situation is the 
price of the corresponding optimal configuration. 

^ See jS] for extensions of the model under which this equivalence is lost. 

® For the sake of simplicity we also allow for the “empty demand” y = (0, 0,0,0, 0). 
Observe that these small examples already define problem classes of size 7®. 
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3.2 Supporting Problem Solving 

An obvious idea in connection with a repetitive configuration problem is to 
utilize the experience from previously solved problems in order to improve future 
problem solving. In this connection, a case-based approach seems particularly 
well-suited, since the problems under consideration are similar in the sense that 
they share the same knowledge-base. 

Resource-based configuration problems can be approached efficiently by 
means of heuristic search methods jOj. Thus, one way of utilizing case-based 
experience is that of supporting the search process. Suppose, for instance, that 
we take the initial demand and the empty configuration as a point of departure 
and that a single search decision corresponds to adding a certain component a 
to the current (partial) configuration x. Such a decision, the cost of which is 
given by the cost of a, simply reduces the current problem y (associated with 
x) to the new configuration problem y' = y — a. Of course, the efficiency of the 
search procedure crucially depends on the quality of the heuristic rules which are 
used for guiding the search process, e.g., for deciding which components to add 
or when to break off a search path. The function Ch,M defined in provides 
valuable information for supporting such decisions. Let us briefly comment on 
two possibilities of utilizing this function. A deeper discussion of corresponding 
approaches, however, is beyond the scope of this paper. 

Since the depth of a search tree is generally not finite, an important problem 
consists of deciding when to break off a search path. The function Ch,M specifies 
bounds on the costs of the configuration problems (supplemented with levels 
of confidence,) which can be used in various ways for supporting this decision 
problem. The value Ch,M{y)i for instance, defines a (heuristic) lower bound l{y) 
and a corresponding upper bound u{y) for the cost of the original configuration 
problem y (associated with the root of the search tree.) If the cost of the current 
frontier node exceeds u{y), it seems likely that the corresponding path is not 
optimal. This argumentation also applies to all subtrees and, hence, can be used 
for guiding a generalized backtracking or an iterative deepening algorithm. 

Another way of utilizing the cost bounds specified by Ch,M is to support the 
decision of which component to add next. If x and y denote the current configura- 
tion and the original demand, respectively, then {xxp+pk+Ch,M{y~ 
defines bounds on the optimal solution associated with the decision of adding the 
A:th component to x, i.e., on the optimal solution located in the corresponding 
subtree. These bounds can be used for selecting the most promising component. 
More generally, Ch,M (y) can be combined with a heuristic estimation ip{y) of a 
cost value <p{y), where the function ip is an approximation of the cost function 
(p : S ^ TZ (which maps demands to the cost of optimal solutions 0|-) This 
approach to improving the accuracy of predictions is an interesting example for 
combining information provided by reasoning at the system level (respresented 
by p) and reasoning at the similarity level. Let us elaborate on these ideas more 
closely. 
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3.3 Supporting Data Analysis 

An obvious question in connection with the idea of basing search decisions on 
set- valued predictions of cost values concerns the quality of such predictions. 
We have quantified the latter for the first configuration problem by means of 
the expected width of an interval ChM,M{y) and the probability of an invalid 
prediction ChM,M{y) ^ where At is chosen at random and h_M is de- 

rived from At according to the algorithm proposed in ^3- Figure |2| shows these 
values, which have been obtained by means of experimental studies, as a func- 
tion of the size of the memory. (Please note the different scaling of the two 
cc-axes.) As can be seen, the probability of an invalid prediction quickly con- 
verges toward 0. The non-monotonicity of the expected precision of predictions 
is caused by two opposite effects which occur in connection with the derivation 
of predictions from a memory AJ and the induced hypothesis hj^ . Observe that 
{h < h') ^ {Ch'^Miv) C ChMiv)) and (At' C M) ^ {ChMiv) C Ch,M'{v)) for 
all hypotheses h,h', memories A4,A4', and y G S. The above-mentioned effect 
is then explained by the fact that Ad' C At implies Hm ^ according to the 

approch in [ 113 - 





Fig. 2. Left: Expected width of a set-valued prediction Ch,M{y) as a 
function of the size of the memory. Right: Expected probability of an 
invalid prediction. 



Now, let us come back to the idea of combining a case-based approximation 
Ch,M and an approximation (o of a cost function. More specifically, suppose 
(p to be defined as (p{y) = [aiyi + ■ ■ ■ + cxnym], where [•] : K ^ Z maps real 
numbers to closest integer values. Given a set of observations in form of a memory 
Ad, the coefficients Oi, . . . ,Orn can be determined by means of a least squares 
approximation. Moreover, an estimation /le of the distribution of the residuals 
e = <p(y) — (p{y) for the complete problem class S can be derived from frequency 
information provided by the set {r — ip{s) \ {s, r) G Ad} of approximation errors. 
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Thus, /te(e) is an estimation of the probability that ^p{y) — (p{y) = e (if y is 
chosen from S at random.) Then, given a certain demand y, the value ^y{p) = 
Peip — p{y)) can be considered as the probability that p{y) = p. 

Using the method proposed in we can also derive a similarity hypothesis 
Hm and, hence, a case-based approximation Chm,M together with a confidence 
level a from the memory M . . The combination of predictions derived from ip and 
Ciim.M can be achieved, for instance, by applying Jeffrey’s rule to py and the 
(uncertain) event C'/i^,>i(?/)0 This leads to the revised probability measure 

p'y = (1 - Q-) . I ChM,Mi.y)) + ot- (4) 

where 'jly{- \ A) denotes the measure py conditioned on A CTZ, PyiCh^ ,M(y)) > 
0, and Py{'R-\Chj^^,M{y)) > 0- We might even think of replacing 0) by the simple 
conditional measure py{- \ C'/i^ ^(y)) if Ai is sufficiently large and, hence, a is 
close to 0. Experimental resulttlj for the set-ups S\ and S 2 clearly indicate that 
a combination of the two information sources via m improves the accuracy of 
predictions. 



4 Summary 

We have developed a formal approach to similarity-based reasoning which al- 
lows for deriving (set- valued) predictions of unknown outcomes (solutions). It 
has been argued that such predictions can be utilized for supporting, e.g., data 
analysis or optimization. The following points deserve mentioning: 

• We have introduced a formal framework in which the task of case-based in- 
ference has been defined as one of predicting resp. characterizing the outcome 
associated with a new situation. The distinction between reasoning at the system 
level and reasoning at the similarity level has been emphasized. 

• We have adopted a constraint-based view of Cbi, according to which the Cbr 
hypothesis imposes constraints on the relation between the similarity of situa- 
tions and the similarity of corresponding outcomes. 

• The concept of a similarity profile establishes a connection between the system 
level and the similarity level and (partly) represents the similarity structure of 
a Cbi set-up. A similarity hypothesis defines an approximation of a similarity 
profile and can be seen as a quantification of the Cbr hypothesis. This concept 
allows for realizing Cbi in form of a constraint-based inference scheme. 

The uncertain information that tp{y) G Chj^,M{y) with a probability of (at least) 
1 — a (and p{y) G 'ti\Chj^,M{y) with a probability of (at most) a) is treated here as 
an unreliable observation, whereas, according to the nsual interpretation of Jeffrey’s 
rule, uncertain inputs are considered as constraints on the revised probability mea- 
sure (which entails a certain dissymmetry between the role of the two information 
sources.) In fact, should be seen as implementing the idea of average focusing 
which coincides in our case with Jeffey’s rule since Chj^,M{y) and TZ \ Ch,M{y) 
define a partition of TZ. 

® These results are omitted here due to reasons of space. 
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• The proposed prediction method takes also the quality of the similarity struc- 
ture into account (as opposed to, e.g., the /c-Nearest Neighbor algorithm.) 

Particularly, it is possible to provide a confidence level for the validity of pre- 
dictions. The usefulness of this approach has been illustrated by means of an 

example from the field of knowledge-based configuration. 
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Abstract. In partitional fuzzy clustering, each cluster is characterized 
by two items: its centroid and its membership function, that are usually 
interconnected through distances between centroids and entities (as in 
fuzzy c-means). 

We propose a different framework for partitional fuzzy clustering which 
suggests a model of how the data are generated from a cluster structure 
to be identified. In the model, we assume that the membership of each 
entity to a cluster expresses a part of the cluster prototype reflected in 
the entity. Due to many restrictions imposed, the model as is leads to 
removing of unneeded cluster prototypes and, thus, can serve as an index 
of the number of clusters present in data. 

A comparative experimental study of the method fitting the model, its 
relaxed version and the fuzzy c-means algorithm has been undertaken. In 
general, the study suggests that our methods can be considered a model- 
based parallel to the fuzzy c-means approach. Moreover, our generic ver- 
sion can be viewed as a device for revealing “the natural cluster struc- 
ture” hidden in data. 



1 Introduction 

In hard partitional clustering m. each entity belongs to only one cluster, and 
thus, the membership functions are zero-one vectors. In fuzzy clustering, the 
condition of exclusive belongingness for entities is relaxed, and the membership 
becomes fuzzy expressing the degree of membership of an entity to a cluster. 
Cluster prototypes are usually defined as weighted averages of the corresponding 
entities. The most known example of this approach is the fuzzy c-means method 
initially proposed by Dunn |3| and generalized by Bezdek 0SE1- 

Usually, membership functions are defined based on a distance function, such 
that membership degrees express proximities of entities to cluster centers. Even 
though the Euclidean distance is usually chosen, as in the original fuzzy c-means, 
other distance functions like h and loo (belonging to the family of Minkowski 
distances) and Mahalanobis distances, have been applied in partitional fuzzy 
clustering (see |HE|)- These approaches, typically, fail to explicitly describe how 
the fuzzy cluster structure relates to the data from which it is derived. 

The present work proposes a framework for fuzzy clustering based on a model 
of how the data is generated from a cluster structure to be identified. The un- 
derlying fuzzy c partition is supposed to be defined in such a way that the 
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membership of an entity to a cluster expresses a part of the cluster’s proto- 
type reflected in the entity. This way, an entity may bear 60% of a prototype A 
and 40% of prototype B, which simultaneously expresses the entity’s member- 
ship to the respective clusters. The prototypes are considered as offered by the 
knowledge domain. This idea can be implemented into a formal model differ- 
ently, depending on the assumed relation between observed data and underlying 
prototypes. A seemingly most natural assumption is that any observed entity 
point is just a convex combination of the prototypes and the coefficients are the 
entity membership values. This approach was developed by Mirkin and Satarov 
as the so-called ideal type fuzzy clustering model |S] (see also ESI). It appears, 
prototypes found with the ideal types model are extremes or even outsiders with 
regard to the “cloud” of points constituting the data, which makes the ideal type 
model very much different from the other fuzzy clustering techniques: the proto- 
types found with the other methods tend to be centroids rather than extremes, 
in the corresponding clusters. 

Thus, we consider here a different way for pertaining observed entities to 
the prototypes: any entity may independently relate to any prototype, which 
is similar to the assumption in fuzzy c-means criterion. This approach can be 
considered as an intermediate between the fuzzy c-means clustering and ideal 
type fuzzy clustering. It takes the adherence to the centroids from fuzzy c-means, 
but it considers the membership as a multiplicative factor to the prototype in a 
manner similar to that of the ideal type fuzzy clustering. The model is referred 
to as the Fuzzy Clustering Multiple Prototype (FCMP) model. 

The paper is organized as follows. Section 2 introduces fuzzy partitional 
clustering with the fuzzy c-means algorithm. In section 3, the FCMP model for 
fuzzy clustering is introduced as well as a clustering algorithm to fit the model. 
Actually, two versions of the model are described: a generic one, FCMP-0, and 
a relaxed version, FCMP-1. Section 4 discusses the results of a comparative 
experimental study between FCMP models and the fuzzy c-means (FCM) using 
simple data sets from the literature. To study the properties of the FCMP model 
in a systematical way, a data generator has been designed. Section 5 discusses 
the results of an experimental study using generated data. Conclusion on the 
results and future work is in section 6. 

2 Fuzzy c-Means Algorithm 

The fuzzy c-means (FCM) algorithm 0 is one of the most widely used methods in 
fuzzy clustering. It is based on the concept of fuzzy c-partition HU, summarized 
as follows. 

Let X = {xi,...,x„} be a set of given data, where each data point Xfc 
(k = 1, . . . ,n) is a vector in 3?^, Ben be a set of real c x n matrices, and c be an 
integer, 2 < c < n. Then, the fuzzy c-partition space for X is the set 



^fen — € Ben • '^ik ^ [ 0 ; 1 ] 

Ei=i“*fc= 1, 0 } > 



( 1 ) 
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where utk is the membership value of in cluster i {i = 1, . . . , c). The value of 
c is assumed to be known. 

The aim of the FCM algorithm is to find an optimal fuzzy c-partition and 
corresponding prototypes minimizing the objective function 



Jm{U, V; X) = (u,kr lixfe - v,f . (2) 

fc=li=l 

In 0, = (vi, V 2 , . . . , Vc) is a matrix of unknown cluster centers (prototypes) 

Vi € II j| is the Euclidean norm, and the weighting exponent m in [l,oo) is 

a constant that influences the membership values. 

The FCM clustering criterion belongs to the class of least squares clustering 
criteria |^. 

To minimize criterion Jm, under the fuzzy constraints defined in ID, the 
FCM algorithm is defined as an alternating minimization algorithm (cf. |S| for the 
derivations), as follows. Choose a value for c, m and e, a small positive constant; 
then, generate randomly a fuzzy c-partition and set iteration number t = 0. 
A two-step iterative process works as follows. Given the membership values , 
the cluster centers f = 1, . . . , c ) are calculated by 



(t) 

v; = 



( (t)\™ 

lZk=l ) Xfc 



Given the new cluster centers , update membership values u[^j} : 



( 3 ) 



'i 



c 



E 




-1 



( 4 ) 



The process stops when < e, or a predefined number of iter- 

ations is reached. 



3 A Multiple Prototype Fuzzy Clustering Model 

3.1 The Generic Model 

Let the data matrix X be preprocessed into Y by shifting the origin to the 
data gravity center and scaling features by their ranges. Thus, Y = [yth] is a 
n X p entity-to- feature data table where each entity, described by p features, 
is defined by the row- vector yk = [ukh] & {k = 1 • • - n ; h = 1- ■ - p). This 
data set can be structured according to a fuzzy c-partition which is a set of c 
clusters, any cluster i (i = 1, • • • , c) being defined by: 1) its prototype, a row- 
vector Vi = [vih] G and 2) its membership values {uik} {k = 1- ■ - n), so that 
the following constraints hold: 
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0 < Uik < 1, foralH = 1, . . . , c,k = 1, . . . , n; (5) 

= 1: forall/c = 1, . . . , n. (6) 

Notice that in this definition of fuzzy c-partition, the condition 0 < Uik < 
n in the original definition (^) is relaxed. 

Let us assume that each entity = [ykh] of Y is related to each prototype 
= [vih] (i = 1, • • • , c) up to its membership degree Uik] that is, Uik expresses that 
part of Vi which is present in in such a way that approximately ykh = UikVih 
for every feature h. More formally, we suppose that 



ykh — '^ik'^ih “t” ^ikh: (7) 

where the residual values Sikh are as small as possible. 

The ideal type fuzzy clustering model from 0 assumes that 

C 

ykh — ^ ^ UjkVjh T ^kht (8) 

i=l 

which implies that all entity points yk are convex combinations of the prototypes 
(up to the residuals). Thus, the prototypes according to this latter model must 
lie outside the area of entity points. This may be considered a formalization of 
the concept of ideal type in logics, which falls beyond the scope of current paper 
and will be omitted from consideration. 

A clustering criterion according to (0 can be defined as fitting of each data 
point to each of the prototypes up to the degree of membership. This goal is 
achieved by minimizing all the residual values via the least-squares criterion 

c n p 
k—lh—1 

with regard to the constraints m and (E|). 

The equations in m along with the least-squares criterion m to be mini- 
mized by unknown parameters U and V = (vi, V 2 , . . . , Vc) S 3?'^^ for Y given, will 
be referred to as the generic fuzzy clustering multiple prototypes model, FCMP-0, 
for short. In this model, the principle of the least-squares criterion in the fuzzy c- 
means is extended to a data-to-cluster model framework, which inevitably leads 
to a more complex form of the criterion. 



3.2 Relaxing Restrictions of FCMP-0 

In real domains of application (clinical findings for typical scenarios of diseases, 
personality traits in psychology, types of consumer in market research, and so 
on), the concept of prototype is meaningful in such a way that data entities can 
be described as sharing parts of prototypes. This is the idea underlying FCMP. 
However, the requirement of FCMP-0 that each entity be expressed as a part 
of each prototype is obviously too strong and unrealistic. The intuition leads us 
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to consider that only meaningful sharings, those expressed by high membership 
values, should be taken into account in the equations 

There are two ways to implement this idea in the FCMP framework: in a 
hard manner and in a smooth one. A “hard” version should deal only with 
those equations in 0 that involve rather large values of utk- By specifying a 
threshold, (3 between 0 and 1, only those differences £ikh are left in the criterion 

that satisfy the inequality, Uik > (3- In such a model, FCMP-/3, entities 
may relate to as few prototypes as we wish. In particular, (3 = 0.5 leads to 
exclusive relationship of any entity to one prototype only. The idea of removing 
all small interactions between prototypes and entities from the criterion has been 
proposed in the context of fuzzy c-means clustering by Selim and Ismail in 
several versions, one of which relates to directly thresholding the membership 
weights as suggested in this paragraph above. The authors of H2! referred to 
their approaches as to “soft clustering” as a kind of intermediate between crisp 
clustering and fuzzy clustering. 

In this paper, we consider a different, smooth, manner of dealing with the 
unrealistic feature of FCMP-0. To smooth the members £ikh corresponding to 
small memberships, let us weight the squared residuals in (0 by corresponding 

Uik ■ 



c n p 

Ei{U^ V ^ ^^ Ujk{ykh UikVih) , 

i—l k—lh—1 



subject to the fuzziness constraints and O- 

The model with this criterion will be denoted as FCMP-1. 



( 10 ) 



3.3 Minimizing FCMP Criteria 

An alternating minimization algorithm FCM for fuzzy c-means clustering can 
be extended for minimization of both the FCMP-0 and FCMP-1 criteria subject 
to the fuzzy constraints (0 and Q. 

Each iteration of the algorithm consists of two steps as follows. First, given 
membership matrix U , the optimal prototypes are determined according to the 
first-degree optimum conditions as 



V 



it) ^ 

ih 



SLi 

Ek = r (u^y 



( 11 ) 



The parameters a, b are a = 1 and 6 = 2 for FCMP-0 and a = 2 and 6 = 3, for 
FCMP-1. 

Formula dm is similar to expression o in FCM for the prototypes, which 
shows that the prototypes in FCMP are indeed centroids rather than extremes. 

Second, given prototype matrix V, the optimal membership values are found 
by minimizing criterion (|2|) or (1 1 1 )ll . respectively. In contrast to FCM, minimiza- 
tion of the criteria subject to constraints Q and (0) is not an obvious task; it 
requires an iterative solution on its own. 
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Upon preliminarily experimenting with several options, the gradient projec- 
tion method fmm has been selected for finding optimal membership values 
(given the prototypes). It can be proven that the method converges fast for 
FCMP-0 with a constant (anti) gradient stepsize. 

Let us denote the set of membership vectors satisfying conditions 0) and 



by Q. The calculations of the membership vectors = 






are based on 



vectors = 



^ik 



4k = 4k 4k - iVk, Vi)), (12) 

where a is a stepsize parameter of the gradient method. Then, is to be taken 

as the projection of in Q, denoted by Pg(d[,*^)Q. The process stops when the 
condition < £ is fulfilled. 

Finding membership vectors by minimizing FCMP-1 is performed similarly. 

Thus, the algorithm consists of “major” iterations of updating matrices U and 
V and “minor” iterations of recalculation of membership values in the gradient 
projection method within each of the “major” iterations. 

The algorithm starts with a set of c arbitrarily selected prototype points 
in 3?^ and [7*^; it stops when the difference between successive prototype matrices 
becomes small. 

The algorithm converges only locally (for FCMP-1). Moreover, with a “wrong” 
number of clusters prespecified, FCMP-0 may not converge at all since FCMP-0 
may shift some prototypes to infinity (see discussion in the next subsection) . In 
our experiments, the number of major iterations in FCMP algorithms when they 
converge is small, which is exploited as a stopping condition: when the number of 
major iterations in an FCMP run goes over a large number (in our calculations, 
over 100), that means the process does not converge. 



3.4 FCMP-0 as an Index of Data Structure 

A feature of the FCMP-0 clustering criterion 021) is that it does not change if 
vectors and u^ are changed for ViO and Ui/a for some i, where a is an arbitary 
real. In particular, tending a to infinity, the prototype tends to infinity, too, 
while its membership vector, u^, to zero, without any change in corresponding 
differences e in criterion (0. 

This way, some membership values needed to decrease some of the differences 
in (0 can be increased by simultaneously decreasing some other ones along with 
removing corresponding prototypes. 

This is, in brief, an explanation of the empirically observed phenomenon 
of removing some initially set prototypes from the data set zone by the algo- 
rithm FCMP-0. In such a non-convergence case, the number of prototypes should 

^ The projection Pg(d((^) is based on an algorithm we developed for projecting a 
vector over the simplex of membership vectors its description is omitted 
here. 
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be decreased until FCMP-0 converges. The final number of prototypes can be 
considered a “natural” one relevant to the structure of the data set under con- 
sideration. Some results of experimentally testing of this feature of FCMP-0 are 
presented in two subsequent sections. 



4 Experimental Study I 



The main goal of all experiments is to compare FCMP-0, FCMP-1 and FCM 
(with its parameter m = 2). The emphasis will be done with regard to the 
clustering results rather than the performance of the algorithms. It is quite 
obvious that our criteria, (0 and, especially, (unj, are more complex than that 
of FCM, (E|, and thus require more calculations. 

The results will be discussed in two sequential sections: (a) illustrative re- 
sults with some simple data sets taken from the literature: butterfly j^, MS P|, 
wine and Iris (this section); (b) general results found with a data generator 
designed for this study (next section). 

In our experiments, each of the algorithms, FCM, FCMP-0, and FCMP-1, 
has been run on the same data set (with the same initial setting) for different 
values of c (c = 2, 3, 4 . . .). 

The clustering solutions found by FCMP-0 and FCMP-1, have been char- 
acterised by the following three features: 1) number of clusters found, c'; 2) 
separability; and 3) proximity to the FCM found prototypes. The separability 
index was also calculated for FCM solutions. 

The separability index, Bc = 1 — ~ i assesses the fuzzi- 

ness of partition ?7; it takes values in the range [0, 1] such that Be = 1 for hard 
partitions and Be = 0 for the uniform memberships (cf. [3, pp. 157, for a detailed 
description) . 

The proximity between FCM and FCMP prototypes is defined as follows: 



Dfcm 



T.i,hKh - 



(13) 



where Vih/v[j^ denote FCM/FCMP prototype feature values, respectively. Match- 
ing between FCMP prototypes and FCM prototypes is determined according 
to smallest distances. When the number of prototypes c' found by FCMP-0 is 
smaller than c, only c' prototypes participate in in this case. 

A summary of the results is presented in Tabled where the number of (major) 
iterations, t\, taken by each algorithm has also been registered. The boundary 
value t\ = 100 corresponds to the non-convergence case. 

These experiments show that FCMP-0 indeed converges sometimes to a 
smaller number of prototypes, c', by moving the other prototypes outside of 
the data set zone. For the butterfly, wine and MS’s data sets the numbers of 
prototypes found by FCMP-0 correspond to those in the original data (2, 3, and 
3, respectively). For the Iris data set, FCMP-0 converges only when c' = 2, even 
though the original data set contains three classes (i.e. c=3). This goes in line 
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Table 1. Results of running FCM, FCMP-0 and FCMP-1 algorithms for but- 
terfly, wine, MS and Iris data sets. 



ti d Be Dfcm{%) 

c FCM MP-0 MP-1 MP-0 FCM MP-0 MP-1 MP-0 MP-1 





2 


22 


12 


9 


2 


0.47 


0.9 


0.83 


4.1 


7.4 


Butt 


3 


27 


100 


17 


2 


0.44 


0.87 


0.69 


33 


42.4 


(c=2) 


4 


72 


100 


32 


2 


0.49 


0.84 


0.6 


15.7 


26 




2 


12 


18 


16 


2 


0.25 


0.86 


0.73 


13.8 


15.9 


wine 


3 


21 


20 


13 


3 


0.26 


0.79 


0.87 


7.1 


6.6 


(c=3) 


4 


66 


100 


19 


3 


0.17 


0.7 


0.89 


18.4 


36.4 


MS 


3 


11 


11 


9 


3 


0.73 


0.85 


0.83 


0.19 


0.7 


(c=3) 


4 


23 


100 


8 


3 


0.7 


0.75 


0.82 


0.54 


2.2 




2 


10 


7 


10 


2 


0.67 


0.94 


0.8 


0.36 


1.1 


Iris 


3 


15 


100 


16 


2 


0.56 


0.9 


0.77 


8.71 


10.4 


(c=3) 


4 


28 


100 


18 


2 


0.48 


0.78 


0.76 


6.2 


13 



with the claim made by some authors that, actually, the underlying structure in 
Iris data set consists of two clusters only . 

The dissimilarity values, DpcM, for FCMP-0 are small in the cases when the 
method converges and high in the other cases. This shows that in the case of 
convergence, FCMP-0 prototypes are very much similar to FCM ones. 

The algorithm FCMP-1 always converges to a solution for the various values 
of c (i.e. d = c). However, the pattern of DpcM dissimilarity values for FCMP-1 
closely follows that of FCMP-0. 

Also, FCMP-0 and FCMP-1 partitions are more contrast than FCM ones, 
according to the separability coefficient, Be- Finally, the numbers of major itera- 
tions for both FCMP-0 and FCMP-1 are always less than for FCM. Nevertheless, 
the former algorithms have their running times greater than FCM because of 
the time spent for “minor” iterations. 

5 Experimental Study II 

The main goal of this series of experiments is twofold. First, to perform a more 
extensive comparison of FCMP-0, FCMP-1 and FCM methods by using gener- 
ated data sets. Second, to study the behavior of FCMP-0 as an index of the 
number of prototypes in data. 

In order to study characteristics of the FCMP model in identifying a cluster 
structure, the model should be tested on data exhibiting its own cluster structure 
(a cluster tendency j2]). To accomplish this, a random data generator has been 
constructed as follows. 
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Data Generator 

1. The dimension of the space (p), the number of clusters (c) and numbers 
ni, 712, • • • , ric are randomly generated within prespecified intervals. The data 
set cardinality is defined as n = J2i=i 

2. c cluster directions are defined as follows: vectors S 3?^ (f = 1, • • • , c) are 
randomly generated within a prespecified cube; then, their gravity center o 
is calculated. 

3. For each i, define two p-dimensional sampling boxes, one within bounds 
Ai = [.9oi, l.loj and the other within Bi = [o,Oi] ; then generate randomly 
O.lrii points in Ai and O.Qrii points in Bi. 

4. The data generated are normalized by centering to the origin and scaling by 
the range. 

To visualize data, they are projected into a 2D/3D space of the best principal 
components as can be seen in Figure H 





- 0-1 



Fig. 1. A 3D plot of the best three principal components of generated data 
(n = 225, p = 20, c = 3). The resulting prototypes for FCMP-0, FCMP-1 and 
FCM with c' = 3 are also projected in the same principal components. 



Some 70 data sets have been generated with distinct numbers of prototypes 
(i.e. c = 3, 4, 5, 6, . . .) and different space dimensions (p = 20, 30, 50, ...150). 

For each group of data sets of the same dimension (p) and generated proto- 
types (c), the three algorithms have been compared based on the same parame- 
ters used in experiment I: number of major iterations (ti), number of prototypes 
found (c'), separability coefficient B^ and distances (Dpcm)- 

In general, the results follow those found for the illustrative data sets. 
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1. The dissimilarity values, DpcM, for FCMP-0 prototypes are close to those 
of FCM when FCMP-0 converges. The distances are large when FCMP-0 
does not converge. The pattern of similarities for FCMP-1 follows that for 
FCMP-0. 

2. On average, the number of major iterations (ti) in FCMP-1 is smaller than 
that in FCM, while in FCMP-0 this number does not differ significantly from 
that in FCM. However, the running time is greater for both of FCMP algo- 
rithms, because of the minor iterations with the gradient projection method. 

3. The separability. Be, of FCMP-0 and FCMP-1 solutions is always higher 
than that of FCM. 

4. There is a percentage of cases when the number of clusters found by FCMP- 
0 is smaller than the number of generated prototypes. Also, it was found 
that FCM may also drastically reduce the number of prototypes (by making 
some of the prototypes equal to each other), especially when dimension of 
the space is high (p > 100). Moreover, in some cases, FCM leads to even 
smaller number of prototypes than FCMP-0. 

These conclusions point to features of the models under investigation and 
should not be interpreted as advantages of one over others. 

6 Conclusion 

The fuzzy clustering approach proposed in this paper suggests a model of how 
the data is generated from a cluster structure to be identified. This implies 
direct interpretability of the fuzzy membership values, which alone should be 
considered a motivation for introducing the model-based methods. 

Based on the experimental results obtained in this research, the FCMP-1 
seems a model-based clustering approach that parallels FCM, and FCMP-0 can 
be viewed as a device for estimating the number of clusters in the underlying 
structure to be found. 

This model-based clustering approach seems appealing in the sense that, 
on doing cluster analysis, the experts of a knowledge domain usually have a 
conceptual understanding of how the domain is organized in terms of prototypes. 
This knowledge, put into the format of tentative prototypes, may well serve as 
the initial setting for data based structurization of the domain. In such a case, the 
belongingness of data entities to clusters are based on how much they share the 
features of corresponding prototypes. This seems fair in such application areas as 
mental disorders in psychiatry or consumer behavior in marketing. However, the 
effective utility of the multiple prototypes model still remains to be demonstrated 
with real data. 
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Abstract. In this paper we report the results of a comparative study 
on different variations of genetic programming applied on binary data 
classification problems. The first genetic programming variant is weight- 
ing data records for calculating the classification error and modifying 
the weights during the run. Hereby the algorithm is defining its own 
fitness function in an on-line fashion giving higher weights to ‘hard’ 
records. Another novel feature we study is the atomic representation, 
where ‘Booleanization’ of data is not performed at the root, but at the 
leafs of the trees and only Boolean functions are used in the trees’ body. 
As a third aspect we look at generational and steady-state models in 
combination of both features. 



1 Introduction 

Binary data classification problems (with exactly two disjoint classes) form an 
important application area of machine learning techniques, in particular genetic 
programming (gp) In this paper we compare a number of different variants 

for a GP applied to such problems. Rather than simply tuning on traditional GP 
parameters, we investigate the effect of two significant changes in a fixed GP 
setup (closely matching the setups in [3) ™ combination with a generational, 
respectively steady-state model. 

The first modification we consider amounts to using a fitness function based 
on weighting data records when calculating the total classification error and 
modifying the weights (thus the fitness function) on-line, during the run. This 
feature, called Stepwise Adaptation of Weights (saw) has been first used in 
penalty functions for constraint satisfaction problems (see e.g. |3| and applied 
in one particular setup within a GP |2|. Here we conduct a systematic study of 
combinations of SAW-ing with a new representation in GP. 

A specific representation forms the second line of investigation. In standard 
GP a function set of real valued operators is used and a special operator in the 
root of the tree ‘Booleanizes’ the outcome resulting in a (binary) classification of 
a given data record [Z|. In our atomic representation ‘Booleanization’ of data is 
not performed at the root, but immediately at the leafs of a tree and only Boolean 
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functions are used in the tree’s body. By such a relatively simple function set 
the flexibility to create different models, i.e. trees, is lower than in the standard 
representation, but the models become more transparent, i.e. better readable for 
humans. In practical applications this property is often more important than 
a lower classification error. The first experiments with this representation are 
reported in 

The third aspect we consider concerns steady-state and generational GP mod- 
els. The standard GP approach is based on the generational population model 
and although some authors do use a steady-state model 0, to our knowledge 
there is no experimental comparison available on the relative advantage of either 
model. 

All the experiments were done using the Library for Evolutionary Algorithm 
Programming (leap)Q system for the construction of the GPs, |^. This library is 
currently under development using C-| — h and Design Patterns 0, and is aiming 
to be a framework that makes it easy to test out different methods within the 
field of evolutionary computation. Also it provides an easy way for users to 
incorporate their own techniques, representations and problems, thus assuring a 
fast and easy way of dealing with testing evolutionary algorithms in general. 

2 Data Sets and Experiment Setup 

For comparing different algorithm variants we use four different data sets from 
the Statlog collectiorfl: the Australian Credit, the German Credit, the Heart 
Disease, and the Pima Indians Diabetes data set 0. Each algorithm is eval- 
uated using n-fold cross validation and the performance measure for different 
algorithms is the average classification error over the n-folds. Statlog uses a cost 
matrix for the reported results on the Heart Disease and German Credit data 
set. This influences the measured classification error as different penalties are 
given for each value of an attribute. For instance, misclassifying a patient with 
a heart disease receives a higher penalty than misclassifying a patient without a 
heart disease. We do not consider the Statlog cost matrix. In Tabled we give the 
number of records for each data set and the number of folds done in the cross 
validation tests. 

Besides the two tested features the other components and parameters of our 
GP are kept close to the experiments as described in d, although we do use 
mutation in all cases (depending on the representation). The parameters that 
are common in each tested algorithm variants are summarized in Table d 

3 SAW-ing 

The rationale behind the Stepwise Adaptation of Weights (SAw) mechanism 
is to let the algorithm define the weights itself. The application domain of this 

^ Available on WWW at http://www.wi.leidenuniv.nl/~jvhemert/leap 
^ Available on WWW at http://www.ncc.up.pt/liacc/ML/statlog 
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Table 1. Test data sets 



number of cross 



data set 


records validation 


Australian Credit 


690 


10-fold 


German Credit 


1000 


10-fold 


Heart Disease 


270 


9-fold 


Pima Indians Diabetes 


768 


12-fold 



Table 2. Main GP parameters shared by all algorithm variants 



Parameter 

Initial max. tree depth 

Max. number of nodes 

Initialization 

Population size 

Parent selection 

Bias for linear ranked selection 

Replacement strategy 

Stop condition 



Value 

5 

200 

ramped half-and-half 
1000 

linear ranking 
PI 1.5 

replace worst in population 
perfect classification or 
40000 evaluations 



Mutation (atom/subtree) probability 
Xover probability 
Xover type 

Xover functionsiatoms 
or functionsderminals ratio 



0.1 

0.9 

swap subtrees 
4:1 



mechanism is not restricted to GP and to data analysis, it can be used for any EA 
where the fitness function is composed in a manner represented by Equation^ In 
fact, SAW-ing has been introduced and first applied in the context of constraint 
satisfaction nn. 

fix) = Y, Wr ■ error{x,r) (1) 

r^D 

where Wr is a weight assigned to the record r and error(x, r) is a measure of 
misclassification. In the most simple case 



f 1 if a; classifies r incorrectly 
^ 0 otherwise 



(2) 



In a SAW-ing evolutionary algorithm the weights are initially all set at the 
same value and these weights are repeatedly increased with a certain step size 
Aw at predefined moments during the run. The general mechanism is presented 
in Figure D 

Note that in a classification problem the overall quality of a candidate so- 
lution (the accuracy of a model on the whole data set) is determined by local 
scores on data records. In other words, the evaluation function (fitness function) 
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On-line weight update mechanism 

set initial weights (thus fitness function /) 
while not EA is finished do 

for the next Tp fitness evaluations do 
let the EA run with this / 
end for 

redefine weights in / and recalculate fitness of individuals 

end while 



Fig. 1. Stepwise adaptation of weights (SAw) 



measuring the total classification error is composed by the errors on particular 
data records. An evolutionary algorithm searching for a good model classifying 
the given records in a data set D could use the fitness function (to be minimized) 
defined as follows. 

It is clear that a GP will primarily ‘concentrate’ on classifying those records 
correctly that carry the highest weights. Therefore, the weights need to be deter- 
mined in accordance with the hardness of the records. Nevertheless, to determine 
how weights should be assigned to records appropriately, may require substan- 
tial insight into the problem, which may not be available, or only at substantial 
costs. 

In the present SAW-ing GP implementation the weights are initially set as 
Wr- = I. Redefining the fitness function happens by adding Aw to the weights 
of those records that are misclassified by the best individual at the end of each 
period of Tp fitness evaluations. Obviously, the general rationale behind SAW-ing 
applies to our situation: records that are hard to classify correctly should have 
a higher weight, and by the on-line / adjust the weights according to its own 
experience. 

The parameters of the SAW-ing mechanism as used through the present study 
are displayed in Table El 



Table 3. SAW-ing parameters 



Parameters Value 

Initial weights Wr 1 

Aw 1 

Tp steady-state 200 evaluations 

Tp generational 1000 evaluations 



As stated earlier, SAW-ing has been tested on various problems in the field 
of constraint satisfaction. The results obtained in these experiments are very 
promising. We therefore have extended the study of the SAW-ing mechanism 
to data classification using genetic programming. In Figure 0 we show typical 
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runs of the SAW-ing mechanism on different binary CSPs Figures from other 
problems share a common feature; at first the fitness of the best individual is 
rising and then it suddenly drops to either a local optimum or eventually a global 
optimum. Also every fitness curve of a SAW-ing evolutionary algorithm shows a 
saw-shaped graph, due to redefinition of the weights every Tp fitness evaluations. 





Fig. 2. Fitness curves for the SAW-ing EA on binary CSP. Figure on the left is a 
zoom in of the figure on the right for the first 10000 evaluations. 



4 Atomic Representation 

The motivation behind our so-called atomic representation comes from practical 
applications, where the selection of the best model to classify data is not only 
depending on classification accuracy. In practice, before adopting a certain model 
an intuitive verification by the users common sense takes place. To this end, it is 
crucial that models (trees in a GP based problem solver) are transparent, that is 
easy to read and understand for humans. Often, the size of the trees is limited to 
achieve a satisfactory level of transparency. Here we investigate another option. 
Rather than using a function set of numerical operators and a special operator 
in the root of the tree that ‘Booleanizes’ the outcome (resulting in a binary 
classification of a given data record), we process numerical information at the 
leafs of a tree, transform it into Boolean statements, and apply only Boolean 
functions in the body of the tree (see Figure 0). 

Formally, an atom is syntactically a predicate of the form operator{var^ const), 
built up from a variable indicating a field in the data set, a constant between 0 
and 1, and a comparing operator, denoted by and A>. In this representation 
the conditional part of a classification rule could look like: 

(A>(rj,0.3) nor A<(ro,0.6)) or A>(rj,0.2) 

In this representation we use subatomic mutation. Every time an individual is 
selected for a mutation, we first choose a node in the tree to work on. If this node 
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Fig. 3. Representation of a classification rule as a tree. 



is part of the function set, a subtree mutation will be performed. If this node is a 
leaf (an atom), we choose with equal chance if this will be a subtree mutation or 
a subatomic mutation. A subatomic mutation works by first selecting, with equal 
chance, if the operation will be performed on the variable or on the constant. 
In case of a variable we randomly select a new variable. In case of the constant 
c a small number Ac {—d < Ac < d) is generated which is then added to the 
constant as show in Equation 0 The values for all records are between 0 and 1 
in the data sets we consider. 



With this representation we use the parameter setting displayed in Table 0 
and with the standard GP representation we use the parameter setting displayed 

in Table 0 

Table 4. Parameters of the GP with an atomic representation. 




0, if c + Z\c < 0, 

1, if c + Ac > 1, 
c + Ac, otherwise. 



(3) 



Parameter 



Value 



Function set 
Atom set 



{and, or, nand, xnor} 
attribute greater or less 



than a constant 



Mutation type 



1. subtree replacement 

2. subatomic mutation 



Subatomic parameter d 



0.1 



5 Steady-State vs. Generational Model 

Here we will look at steady-state and generational GP models. The standard 
GP approach is based on the generational population model 0 and although 
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Table 5. Parameters of the standard GP. 



Parameter Value 

Function set {+,— ,x,%} 

Terminal set {xi, . . . , x„} U [0, 1] 
Mutation type 1. subtree replacement 
2. point mutation 



some authors do use a steady-state model Hi, to our knowledge there is no 
experimental comparison available on the relative advantages of these models. 

In the steady-state variant we update the population after creating two off- 
spring, thus after one cycle of parent selection, crossover and mutation, and use 
Tp = 200 (recalculating the weights after 100 cycles). In the generational case, 
however, maintaining the same setup is not possible. Namely, for a generational 
GP the update interval Tp must be a multiple of the population size, otherwise a 
weight revision would be necessary in the middle of creating a new generation. 
This imposes a restriction on the combination of SAW-ing and the generational 
model, leaving less freedom for the user to determine an algorithm setup. In or- 
der to allow the maximal number of weight recalculations, while maintaining the 
population size 1000 we use Tp = 1000 in the experiments with the generational 
model. Notice that this implies a handicap for SAW-ing as the number of weight 
adjustments is reduced by a factor of five. 

6 Results 

We are presenting the results of the experiments arranged around the four data 
sets used in this study. For each algorithm variant we give the average classifi- 
cation error percentage as performance measure. The best result in each table 
is presented in italics. 

Looking at the tables]^ through 0we can make a number of observations. Un- 
der the steady-state model the SAW-ing variant has or shares the first place in five 
of the eight tests. Looking from a different perspective, the standard GP seems 
slightly better than the atomic representation. Having a better performance five 
times out of eight. 

To give some insight of what influence the SAW-ing mechanism has on the 
behavior on the fitness function we give some figures of a typical run of the GP. 



Table 6. Average classification error on the Australian Credit data set. 



steady state generational 
algorithm atom standard atom standard 



SAW 0.278 0.241 0.242 0.242 

no-SAW 0.246 0.241 0.243 0.232 
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Table 7. Average classification error on the German Credit data set. 



steady state generational 
algorithm atom standard atom standard 
"saw 0.281 0.303 0.295 0.292 

no-SAW 0.281 0.284 0.300 0.281 

Table 8. Average classification error on the Heart Disease data set. 



steady state generational 
algorithm atom standard atom standard 



SAW 0.200 0.211 0.222 0.181 

no-SAW 0.211 0.200 0.230 0.170 



In Figure 0 we show plots of the steady-state GP, with and without the SAW-iM 
mechanism. The GP without SAW-ing shows the typical decrease of the fitnesfl 
However, the plot of the GP with SAW-ing shows much fluctuation, in the range 
of 200. It is therefore crucial to stop the algorithm at the right time, i.e. when 
the fitness value is on its lowest point. 



;teady_state GP without SAW-ing 





Fig. 4. Typical fitness curve for the steady-state GP without SAW-ing (left) and 
with SAW-ing (right) on the Australian Credit data set (first fold). 



7 Comparison with Other Techniques 

In order to compare our GP to other data mining techniques found in 0, we 
have chosen three algorithms outside the field of evolutionary computation. The 
algorithms we have chosen are LogDisc (a statistical method), C4.5 (a well- 
known decision trees algorithm) and Back-propagation (the standard algorithm 

® Recall that we are trying to minimize the fitness function 
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Table 9. Average classification error on the Pima Indian Diabetes data set. 



steady state generational 



algorithm atom standard atom standard 
SAW-ing 0.263 0.257 0.267 0.257 

no-SAW-ing 0.283 0.266 0.250 0.253 



for training multi-layered feed- forward networks). Since the results reported in 
Statlog have used a cost matrix (Heart Disease and German Credit |3I21) which 
we have not, we only compare our results to the Australian Credit and Pima 
Indians Diabetes data sets. The results can be found in Table cni As represen- 
tative for GP we use the standard (generational) GP which has the best overall 
performance. 



Table 10. Results on the Australian Credit and Pima Indians Diabetes Datasets. 



data set Australian Credit Pima Indians Diabetes 



BackProp 


0.154 


0.248 


C4.5 


0.155 


0.270 


standard GP 


0.232 


0.253 


LogDisc 


0.141 


0.223 



As we can clearly see in Table E] the standard GP performs poorly on the 
Australian Credit data set when compared to the other 3 techniques. The per- 
formance of the standard GP on the Pima Indians data set not too bad. It is 
close to back-propagation and much better than C4.5. 

8 Conclusions and Future Research 

The comparison of the atomic and the standard representation indicates a hard 
trade-off situation. Giving up the flexibility of numerical operators for the sake 
of transparency achieved by using Boolean functions in the bodies of the trees 
comes at costs of performance. It seems that there is no generally advisable 
option, the choice between the two representations has to be made on a case-by- 
case basis, depending on the priorities in the given problem context. 

Looking at the differences between the effects of the SAW-ing mechanism we 
should distinguish between the steady-state and the generational models. Using 
the steady-state model the SAW-ing variant helping the GP to a better perfor- 
mance three times out of eight, is not able to provide a constant improvement. 
This is in contrast to the results found in constraint satisfaction PI El and it 
needs further investigation to pinpoint the exact reason of this observation. Al- 
though, one observation shows that the GP version with SAW-ing show much 
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fluctuation in the fitness, causing a performance decrease when the algorithm is 
stopped at the wrong time, i.e. when the best solution in the population does 
not perform well. 

These experiments are just the beginning of an extensive study of using 
genetic programming for data mining purposes. Research in the near future will 
focus on the comparison of techniques in this paper and the standard GP using 
commercial data as a benchmark. 
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Abstract. The well-known fuzzy c-means algorithm is an objective func- 
tion based fuzzy clustering technique that extends the classical fc-means 
method to fuzzy partitions. By replacing the Euclidean distance in the 
objective function other cluster shapes than the simple (hyper-)spheres 
of the fuzzy c-means algorithm can be detected, for instance ellipsoids, 
lines or shells of circles and ellipses. We propose a modihed distance 
function that is based on the dot product and allows to detect a new 
kind of cluster shape and also lines and (hyper-)planes. 



1 Introduction 

Fuzzy clustering techniques aim at finding a suitable fuzzy partition for a given 
data set. For a fuzzy partition a datum is not necessarily assigned to a unique 
class or cluster, but has membership degrees between zero and one to each 
cluster. Fuzzy clustering algorithms are applied for various reasons: 

— The membership degrees give information about the ambiguity of the clas- 
sification. 

— Fuzzy clustering can adapt to noisy data and classes that are not well sepa- 
rated. 

— Since most fuzzy clustering approaches are based on optimizing an objective 
function, membership degrees represent continuous parameters so that a 
continuous optimization problem has to be solved. 

— Fuzzy clustering can be applied to learning fuzzy rules from data. 

In this paper we briefly review the principal objective function-based fuzzy 
clustering approach in section 0 Various modifications of the distance function 
in the objective function have been proposed in order to model different clus- 
ter forms. In section 0 we introduce a new angle-based distance measure that 

^ This work was supported by the European Union under grant EFRE 98.053 
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is suitable for data sets with a smaller number of extreme values and a large 
number of ‘normal’ values. Section 21 modifies this approach and we obtain a 
clustering algorithm to detect lines and (hyper-)planes that can be applied to 
line recognition as well as to constructing Takagi-Sugeno fuzzy rule systems (see 
for instance m) that describe a function in terms of local linear models. 

2 Objective Function-Based Fuzzy Clustering 

We cannot give a complete overview on fuzzy clustering here and mention only 
the basic ideas in order to provide the background for our new algorithms. For 
a thorough overview on fuzzy clustering we refer to |2I9| . Most fuzzy clustering 
algorithms aim at minimizing the objective function of weighted distances of the 
data to the clusters 



i—1 k—1 

under the constraints 

n 

''^Uik > 0 for alH e {1, . . . c} (2) 

fc=i 

and 

C 

^^Uik = 1 for all fc G {1, . . . n}. (3) 

i=l 

X = {xi, . . . ,Xn} C ffiP is the data set, c is the number of fuzzy clusters, 
Uik G [0, 1] is the membership degree of datum Xk to cluster i, Vi is the prototype 
or the vector of parameters for cluster i, and d(vi,Xk) is the distance between 
prototype Vi and datum Xk- The parameter to > 1 is called fuzziness index. For 
TO ^ 1 the clusters tend to be crisp, i.e. either Uik ^ 1 or Uik — > 0 resulting in 
the hard c-means algorithm, for to — > oo we have Uik —>■ Ijc. Usually to = 2 is 
chosen. ensures that no cluster is empty, m enforces that for each datum 
its classification can be distributed over different clusters, but the sum of the 
membership degrees to all clusters has to be one for each datum. Therefore, for 
this approach the membership degrees can be interpreted as probabilities and the 
corresponding clustering approach is called probabilistic. The strict probabilistic 
constraint was relaxed by Dave who introduced the concept of noise clustering 
prZj . An additional noise cluster is added and all data have a (large) constant 
distant to this noise cluster. Therefore, noise data that are far away from all 
other clusters are assigned to the noise cluster with a high membership degree. 
Krishnapuram and Keller H2] developed possibilistic clustering by completely 
neglecting the probabilistic constraint Q and adding a term to the objective 
function that avoids the trivial solution assigning no data to any cluster. We 
cannot discuss the details of these approaches here and restrict our considerations 
to the probabilistic fuzzy clustering approach. However, our algorithms can be 



Fuzzy Clustering Based on Modified Distance Measures 293 



applied in the context of noise and possibilistic clustering in a straight forward 
way. 

We also do not consider the problem of determining the number of clusters 
in this paper and refer to the overview given in . 

The basic fuzzy clustering algorithm is derived by differentiating the La- 
grange function of dO taking the constraint Q into account. This leads to the 
necessary condition 

Utk = (4) 

\d^(vj,Xk) ) 

for the membership degrees for a (local) minimum of the objective function, 
given the prototypes are fixed. In the same way, we can derive equations for the 
prototypes, fixing the membership degrees, when we have chosen the parameter 
form of the prototypes and a suitable distance function. 

The corresponding clustering algorithm is usually based on the so-called al- 
ternating optimization scheme that starts with a random initialization and alter- 
natingly applies the equations for the prototypes and the membership degrees 
until the changes become very small. Convergence to a local minimum or (in 
practical applications very seldom) a saddle point can be guaranteed jili) . 

The most simple fuzzy clustering algorithm is the fuzzy c-means (FCM) (see 
f.e. PI) where the distance d is simply the Euclidean distance and the prototypes 
are vectors Ui S . It searches for spherical clusters of approximately the same 
size and by differentiating we obtain the necessary conditions 



Vi 



SLi <kXk 

E n 



^ik 



( 5 ) 



for the prototypes that are used alternatingly with 0) in the iteration procedure. 

Gustafson and Kessel |H] designed a fuzzy clustering method that can adapt 
to hyper-ellipsoidal forms. The prototypes consist of the cluster centres Vi as in 
FCM and (positive definite) covariance matrices Ct. The Gustafson and Kessel 
algorithm replaces the Euclidean distance by the transformed Euclidean distance 

d^{vi,Xk) = {detCiY/P ■ {xk-Vi)^C~'^{xk-Vi). 



Besides spherical or ellipsoidal cluster shapes also other forms can be de- 
tected by choosing a suitable distance function. For instance, the prototypes of 
the fuzzy c-varieties algorithm (FCV) describe linear subspaces, i.e. lines, planes 
and hyperplanes m- The equations for the prototypes of this algorithm re- 
quire the computation of eigenvalues and eigenvectors of (weighted) covariance 
matrices. FCV can be applied to image recognition (line detection) and to con- 
struct local linear (fuzzy) models. Shell clustering algorithms are another class 
of fuzzy clustering techniques that are mostly applied to image recognition and 
detect clusters in the form of boundaries of circles, ellipses, parabolas etc. (For 
an overview on shell clustering see 0III-) 
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In principal any kind of prototype parameter set and distance function can 
be chosen in order to have flexible cluster shapes. However, the alternating opti- 
mization scheme, that can at least guarantee for some weak kind of convergence, 
can only be applied, when the corresponding distance function is differentiable. 
But even for differentiable distance functions we usually obtain equations for the 
prototypes that have no analytical solution (for instance |^). This means that 
we have to cope with numerical problems and need in each iteration step a nu- 
merical solution of a coupled systems of non-linear equations. Other approaches 
try to optimize the objective function directly by evolutionary algorithms (for 
an overview see USD- Nevertheless, fuzzy clustering approaches with distance 
functions that do not allow an analytical solution for the prototypes are usually 
very inefficient. In the following we introduce a new fuzzy clustering approach 
that admits also an analytical solution for the prototypes. 

3 Clustering with Angle-Based Distances for Normalized 
Data 

The idea of our approach is very similar to the original neural network com- 
petitive learning approach as it is for instance described in M- Instead of the 
Euclidean distance between a class representative and a given datum that Koho- 
nen’s self organizing feature maps use, the simple competitive learning approach 
computes the dot product of these vectors. 

For normalized vectors the dot product is simply the cosine of the angle 
between the two vectors, i.e. the dot product is one if and only if the (normalized) 
vectors are identical, otherwise we obtain values between —1 and 1. Therefore, 
we define as the (modified) distance between a normalized prototype vector v 
and a normalized data vector x 

(f{v,x) = 1 — v^x. (6) 

Thus we have 0 < cf{v,x) < 2 and, in case of normalized vectors, (P{v,x) = 
0 X = V. 

Let us for the moment assume that the data vectors are already normalized. 
How we actually carry out the normalization will be discussed later on. With 
the distance function (0 the objective function (m becomes 

c n 
2—1 k—1 

c n / p 

= E E ( E 

i=i fc=i V e=i 

where vu and Xki is the ^th coordinate/component of vector Vi and Xk, respec- 
tively. By taking into account the constraint that the prototype vectors vt have 
to be normalized, i.e. 

II f = E^?* = 

t=i 



1 



(7) 
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we obtain the Lagrange function 

c n / p \ ^ 

L = VUXM I + 51 51 ^ 

i=i fe=i V e=i ) s=i V t=i 

The partial derivative of L w.r.t. vn yields 



dL 

dvii 



n 

^ ^ ^ikXki “t" 
k=l 



Since the first derivative has to be zero in a minimum, we obtain 



V^£ 



1 



n 

fe=l 



Making use of the constraint m, we have 



1 



p / n \ ^ 

^ 51 ( 51 ) ’ 

» e=i \k=i ) 



which gives us 



2A, 



so that we finally obtain 



p 





Vii 



SLl <k^kf. 



as the updating rule for the prototypes. 



( 8 ) 



(9) 



( 10 ) 



unnormalized 




Fig. 1. Normalization of a datum 



For this formula we have assumed that the data vectors are normalized. 
When we simply normalize the data vectors, we loose information, since collinear 
vectors are mapped to the same normalized vector. In order to avoid this effect 
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Fig. 2. Two clusters 



we extend our data vectors by one component which we set one for all data 
vectors and normalize these {p + l)-dimensional data vectors. In this way, the 
data vectors in are mapped to the upper half of the unit sphere in 
Figure n illustrates the normalization for one-dimensional data. 

Figure El shows a clustering result for a two-dimensional data set (i.e. the 
clustering is actually carried out on the normalized three-dimensional data) . The 
membership degrees are not illustrated in the figure. We have connected each 
datum with the cluster centre (that we obtain by reversing the normalization 
procedure) to which it has the highest membership degree. 




Fig. 3. The one-dimensional distance function 



It can be seen in figure El that the prototype of the upper cluster is slightly 
lower than one might expect. The reason is that the distance function is not affine 
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invariant. We can already see in figure 0that vectors near zero keep almost their 
Euclidean distance when we normalize them, whereas very long vectors are all 
mapped to the very lower part of the semi-circle. 

Figure 0shows distance values of two one-dimensional vectors. (The distance 
is computed for the normalized two-dimensional vectors.) Of course, the distance 
is zero at the diagonal and increases when we go away from the diagonal. But 
the distance is increasing very quickly with the distance to the diagonal near 
zero, whereas it increases slowly, when we are far away from the diagonal. 




Fig. 4. Distance to the point (0, 0) 



Figures 0 and 0 also illustrate this effect. In Figure 0 the distance to the 
(non-normalized) two-dimensional vector (cluster centre) (0,0)^ is shown. It is 
a symmetrical distance function. However, when we replace the cluster centre 
(0,0)^ by the vector (1,0)^, we obtain the function in figure0 

Here we can see that the distance is asymmetrical in the sense that it increases 
faster when we look in the direction of (0,0)^. This can be an undesired effect 
for certain data sets. But there are also data sets for which this effect has a 
positive influence on the clustering result. Consider for instance data vectors 
with the annual salary of a person as one component. When we simply normalize 
each component, the effect is that a few outliers (persons with a very high 
income) force that almost all data are normalized to values very near to zero. 
This means that the great majority simply collapses to one cluster (near zero) 
and few outliers build single clusters. Instead of a standard normalization, we 
can also choose a logarithmic scale in order to avoid this effect. But the above 
mentioned clustering approach offers an interesting alternative. 

Figure El shows a clustering result of data of bank customers with the at- 
tributes age, income, amount in depot, credit, and guarantees for credits. The 
number of clusters was automatically determined by a validity criterion, result- 
ing in three clusters. The axes shown in the figure are credit, income, and amount 
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Fig. 5. Distance to the point (1, 0) 



in depot. It is worth noticing that there is a compact cluster in the centre, repre- 
senting the majority of average customers, whereas there are two other clusters 
covering custumers with high credit or a large amount of money, respectively. 



4 Clustering with Angle-Based Distances for 
Non-normalized Data 

In the previous section we have assumed that the data vectors are normalized 
or that we normalize them for the clustering. In this section we discuss what 
happens, when we refrain from normalizing the data vectors and the cluster 
centres. In order to avoid negative distances, we have to modify the distance 
function to 

cP{v,x) = (I — (II) 

The geometrical meaning of this distance function is the following. A datum 
X has distance zero to the cluster v, if and only if v^x = 1 holds. This equation 
describes a hyperplane, i.e. the hyperplane of all a; S IF of the form 

^ + (12) 

where the vectors wi, . . . ,Wp-i G span the hyperplane perpendicular to v 
and Ai, . . . , Ap_i G ffi. 

This means that we can find clusters in the form of linear varieties like the 
FCV algorithm. We will return to a comparison of FCV and this approach later 
on. Figure 0 shows the distance to the prototype = (0.5,0). This prototype 
describes the line 




Fuzzy Clustering Based on Modified Distance Measures 299 







“n gP 



□ □ 



□ □ 



□ 




credit 



Fig. 6. Clustering result for bank customers 



In order to derive equations for the prototypes we insert the distance function 
dnj into the objective function © and take the first derivative w.r.t. vu: 



These derivatives have to be zero at a minimum and we obtain the system of 
linear equations 



Note that the matrix X)fe=i is the (weighted) covariance matrix (as- 

suming mean value zero) and can therefore be inverted unless the data are de- 
generated. 

An example of the detection of two linear clusters is shown in figure El 

The difference of this approach to the FCV algorithm is in the computing 
scheme that requires inverting a matrix whereas for the FCV algorithm all eigen- 
values and eigenvectors have to be computed. Another difference is caused by 
the non-Euclidean distance function that is again not affine invariant. Problems 
can arise when lines are near to (0,0)^, since then the corresponding prototype 




n 



- vj Xk)xk = 0. 



Making use of the fact that {vj Xk)xk = {xkxj)vi holds, we obtain for the 
prototypes 
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Fig. 7. Distance to (0.5, 0) 



vector V is very large, and even small deviations from the linear cluster lead to 
large distances. These problems are well known for other fuzzy clustering algo- 
rithms with non-Euclidean distance functions CD and have to be treated in a 
similar way. 

5 Conclusions 

We have introduced fuzzy clustering algorithms using dot product-based distance 
functions that lead to new cluster shapes in the normalized case and to linear 
clusters in the non-normalized case. They represent a further extension of the 
already known objective function-based fuzzy clustering approaches. 




0 Data Cluster 1 


A Data Cluster 2 


Cluster 1 


Cluster 2 



Fig. 8. Two linear clusters 
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Abstract. The paper deals with clustering of objects described both by 
properties and relations. Relational attributes may make object descrip- 
tions recursively depend on themselves so that attribute values cannot 
be compared before objects themselves are. An approach to clustering is 
presented whose core element is an object dissimilarity measure. All sorts 
of object attributes are compared in a uniform manner with possible ex- 
ploration of the existing taxonomic knowledge. Dissimilarity values for 
mutually dependent object couples are computed as solutions of a sys- 
tem of linear equations. An example of building classes on objects with 
self-references demonstrates the advantages of the suggested approach. 



1 Introduction 

Object-based systems provide a variety of tools for building software models of 
real-world domains : classes and inheritance, object composition, abstract data 
types, etc. As a result, the underlying data model admits highly structured de- 
scriptions of complex real-world entities. With the number of object applications 
constantly growing the need of analysis tools for object datasets becomes critical 
0. Although the importance of the topic has been recognized mm, it has been 
rarely addressed in the literature : a few studies concerning object knowledge 
representation (KR) systems |31I3P, or object-oriented (00) databases 0 have 
been reported in the past years. 

Our own concern is the design of automatic class building tools for objects 
with complex relational structure. A compact class can be built on-top of an 
object cluster discovered by an automatic clustering procedure 0. The main 
difficulty that clustering application faces is the definition of a consistent com- 
parison for all kinds of object attributes. Relations connect objects into larger 
structures, object networks, and may lead to (indirect) self-references in object 
descriptions. Self-references make it impossible to always compare object at- 
tributes before comparing objects. Consequently, most of the existing approaches 
to clustering pni!l cannot apply on self-referencing descriptions. 

We suggest an approach towards automatic class building whose core element 
is an object dissimilarity measure. The measure evaluates object attributes in 
a uniform way. Its values on mutually dependent object couples are computed 
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as solutions of a system of linear equations. Thus, even object sets with self- 
references may be processed by a clustering tool. The discovered clusters are 
turned to object classes and then provided with a intentional description. 

The paper starts by a short presentation of an objects model together with 
a discussion of aspects which make objects similar (Section 13) . A definition of 
a suitable dissimilarity measure is given in Section 0 The computation of the 
measure on mutually dependent object couples is presented. Section 0 shows an 
example of an application of the measure. Clustering and class characterization 
are discussed in Section 0 

2 Object Formalism 

Object languages organize knowledge about a domain around two kinds of enti- 
ties, classes and objects. Domain individuals are represented as objects, whereas 
groups of individuals, or categories, give rise to classes. Thus, each class is as- 
sociated to a set of member objects, its instances. Both classes and objects are 
described in terms of attributes which capture particular aspects of the under- 
lying individuals. A class description is a summary of instance descriptions. 

In the following, only the descriptive aspects of object languages will be 
considered. The object model of Tropes, a system developed in our team 0 will 
be used to illustrate the object specific structure. However, the results presented 
later in the paper hold for a much larger set of object models. 



2.1 Objects, Concepts, and Classes 

A structured object in our model is a list of attribute values which are themselves 
objects or simple values. For instance, an object representing a flat within a real- 
estate knowledge base, may have fields for flat’s rent, standing, owner, etc. Fig. ^ 
shows example of an object, flat#5, with three attributes, rooms, owner and 
rent. 

Three kinds of attributes will be distinguished: properties, eomponents and 
links. Properties model features of the individual being modeled, for example 
the rent of a flat, whereas components and links model relations to other in- 
dividuals. Composition, or part-of, relation between individuals is expressed 
through component attributes. Such an attribute relates a composite object to 
one or a collection of component objects. Composition is distinguished due to 
its particular nature : it is a transitive and non-circular relation. On the above 
example, owner is a link and rooms is a component. Apart for nature, attributes 
have a type which delimits their possible values. Object-valued attributes are 
typed by object concepts whereas simple values are members of data types, fur- 
ther called abstraet data types (adt). Finally, attributes may have a single value 
or a collection of values in which case they are called multi-valued attributes. 
Collections are built on a basic type by means of a constructor, list or set. 
For instance, the values of rooms attribute in f lat#5 is a set of three instances 
of the room concept. 
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Fig. 1. Example of a concept, flat, a class, high-standing and an instance, 
flat#5. All three entities provide lists of attributes : concept attributes spec- 
ify type, nature and constructor, which are necessary in interpreting values in 
instance attributes; class attributes provide value restrictions on instance at- 
tributes; instance attributes contain values. 



Objects of a KB are divided into disjoint families, called concepts, e.g. flat, 
human, room. An object is thus an instance of a unique concept, for example, 
f lat#5 of the flat concept drawn as a rectangle. Concepts define the structure 
(set of object attributes with their types) and the identity of their instances. This 
may be seen on upper right part of Fig. Q : flat defines the type, the nature 
and the constructor of the three attributes of the example. Concepts are com- 
parable to big classes in a traditional 00 application, or to tables in relational 
databases. A class defines a subset of concept instances called class members. 
Classes provide value restrictions on attribute values of member objects, that is 
sets of admissible values among all the values in the attribute basic type given 
by the concept. Restrictions on property attributes correspond to sub- types of 
the ADT, whereas restrictions on relations are classes of the underlying concept. 
For example, the class high-standing on Fig. Q restricts the rent attribute to 
values in the interval [2200 , 4000] and the rooms attribute to sets composed of 
members of the classes basic and service. Classes of a concept are organized 
into a hierarchical structure called taxonomy. 

An object KB is made up of a set concepts. For each concept, a set of instances 
are given which are organized into one or more class taxonomies. Objects and 
classes of different concepts are connected by means of relational attributes. 



2.2 Summary on Objects as Data Model 

An object is a member of the entire set of instances defined by its concept. 
A class defines a sub-set of instances. In quite the same way, simple values are 
members of the domain of their ADT whereas types define subsets of ADT values. 
Unlike objects, simple values have neither identity, nor attributes. 
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Objects can be seen as points in a multi-dimensional space determined by 
the object’s concept where each dimension corresponds to an attribute. A class 
describes a region of that space and each member object lays exclusively within 
the region. Furthermore, dimensions corresponding to relational attributes are 
spaces themselves. Thus, a relation establishes a dependency between its source 
concept and its target concept : instances of the former are described by means 
of instances of the latter. An overview of all inter-concept dependencies is pro- 
vided by a graphical structure, henceforth called conceptual scheme of the KB, 
composed by concepts as vertices and relational attributes as (labeled) edges. 
For example, the (partial) conceptual scheme of the real estate KB given on the 
left of Fig. 12] is made up of three concepts, human, flat and room, and three 
attributes: two links composing a circuit, owner and house, and a multi-valued 
component relation, rooms. 

The scheme summarizes the relations that may exist between instances of 
different concepts in the KB. In fact, each object is embedded into a similar 
relational structure where each concept is replaced by one or a collection of its 
instances. The structure, which we call the network associated to an object o, 
includes all objects which are related to o by a chain of attributes. In a network, 
an edge corresponding to a multi-valued attribute may link a source object to 
a set of target objects. For example, on the left of Fig. E] the network of the 
object flat#5 is drawn. Within the network, an edge rooms connects flat#5 
to the three objects representing flat’s rooms. Observe the similar topologies of 
both structures on Fig. □ El 

Finally, the network of o extended with classes of all its objects contains the 
entire amount of information about o. It is thus the maximal part of the KB to 
be explored for proximity computation. 



house , V 

QC:;:::uO 




O 









concept 



link ^ 

composition 



Fig. 2. Example of a KB conceptual scheme, on the right, and an object network 
that reflects the scheme structure, on the left. The network is obtained from the 
scheme by replacing a concept by an instance or a collection of instances. 



^ Actually, there is a label-preserving morphism from each instance network to the 
respective part of the conceptual scheme. 
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2.3 Analysis Issues 

We are concerned with class design from a set of unclassified objects within an 
object KB. The problem to solve is an instance of the conceptual clustering prob- 
lem and therefore may be divided into two sub-tasks: constitution of member 
sets for each class and class characterization jS| . A conventional conceptual clus- 
tering algorithm would typically carry out both sub-tasks simultaneously using 
class descriptions to build member sets. With an object dataset where link at- 
tributes relate objects of the same concept this approach may fail. Let’s consider 
the example of the spouse attribute which relates objects representing a mar- 
ried couple. Within a class of human, the spouse attribute refers to a possibly 
different class of the same concept. The attribute induces two-object circuits on 
instances of human and therefore may lead to circular references between two 
classes in the taxonomy. When classes are to be built in an automatic way, it is 
impossible to evaluate two classes, say c and c', referring to each other since the 
evaluation of each class would require, via the spouse attribute, the evaluation 
of the other one. 

We suggest an approach to automatic class building which deals with both 
tasks separately. First, member sets are built by an automatic clustering proce- 
dure based on an object proximity measure. The measure implements principles 
suggested in to compare object descriptions with self-references. Once all 
class member sets are available these are turned into undescribed classes. The 
characterization step is then straightforward since class attributes in particular 
those establishing circuits, may refer to any of the existing classes. 

As our aim is to find compact classes, we require the discovered clusters to be 
homogeneous on all object attributes, inclusive relational ones. Thus, an object 
proximity measure is to be designed which combines in a consistent way the dif- 
ferences on both properties and relations. Keeping in mind the analogy between 
objects and simple values we can formulate the following general comparison 
principle. Given two points in a space, we shall measure their mutual difference 
with respect to the relative size of the smallest region that covers both of them. 
The smallest region is a type in case of primitive values and a class in case of 
objects. It is unique for an ADT, but it may be interpreted in two different ways 
for concepts. In a first interpretation, it corresponds to an existing class, the 
most specific common class of both objects. Thus, the corresponding region is 
not necessary the smallest possible one, but rather the smallest admissible by 
the taxonomy of the concept. The second interpretation is straightforward : the 
region is the effective smallest one, and is thus independent form the existing 
taxonomy. 

In the next section we describe a dissimilarity function based on the above 
principle which deals successfully with circularity in object descriptions. 

3 Dissimilarity of Objects 

A proximity measure, here of dissimilarity kind H2!, is usually defined over a set 
17 of individuals uj described by n attributes Attribute-level functions 
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Si compute elementary differences on each attribute a^, whereas global function 
d = Aggr{Si) combines those differences into a single value. In case of object 
KB, each concept Ci is assigned a separate object set 17/ and hence a specific 
function 

3.1 Object-Level Function 

Let C be a concept, e.g. flat, with n attributes {ai, a„} (we assume that each 
object has all attribute values). In our model, the object dissimilarity function 
dQ : ^ R is a normalized linear combination of the attribute dissimilarities 

S{ : type{ai)^ R. For a couple of objects o and o' in C, dissimilarity is 
computed as : 

n 

dc(o, o') = ^ Aj * S{ {o.ai, o'. a*) . (1) 

where A/ is the weight of attribute a/ ~ !)■ noteworthy that all 

real-valued functions are normalized. 

With respect to what has been said in the previous section, the value of 
d^{o,o') may be interpreted in the following way. Suppose each S{ evaluates 
the size of the smallest region that contains the values o.o/ and o'.o/ within 
the corresponding dimension. Then, d^(o, o') computes the weighted average of 
all such sizes. This is an estimation of the size of the smallest region covering 
o and o' within the space of C. The form of d° has been chosen to enable the 
computation in case of circularity. In the following, possible definitions for S{ are 
presented which are consistent with the general principle of the previous section. 



3.2 Attribute-Level Functions 

Following the attribute type, S{ is substituted in Formula 0 either by a property 
dissimilarity, 6^ or by a relational dissimilarity . 

The dissimilarity of simple values depends on the ratio between the size 
of their most-specific common type and the size of the whole ADT value set. 
Formally, let T be an ADT of domain D. We shall denote by range{t) the size of 
the sub-set of D described by the type t. This value may be a set cardinal, when 
T is a nominal or partially ordered type (structured as in m), or an interval 
length r is a totally ordered type. For a couple of values in D, v and v' , let vVv' 
denote their most specific common type within T. Then : 




range{v V u') — 0, 5 * {range{v) + range{v')) 
range{D) 



(2) 



The subtraction of the average of value ranges in the above formula allows 
to remain consistent with conventional functions on standard types. For 
example, with an integer type T = [0 10] the value of (2,4) = 0, 2. 

In case of object-valued attribute, the relative size of the effective smallest 
region is estimated by applying a function of d° kind. Doing that, the dissim- 
ilarity computation goes a step further in the object network structure, from 
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objects to attribute values. In case of a strongly connected component^ of the 
network the computation comes back to the initial couple of objects, o and o' . 
For example, comparing two flats requires the comparison of their owners which 
in turn requires the comparison of flats. In other words, the self-references in 
object structure lead to recursive dependence between dissimilarity values for 
object couples. Section ro presents a possible way to deal with recursion. 

The class dissimilarity 6''’' compares objects as members of a class and no 
more as attribute lists. More precisely, 6''’' estimates the relative size of the most 
specific common class of two objects. Of course, the function can only be used 
on an attribute a if a class taxonomy is available on C' = type{a). Moreover, 
the computed values only make sense if the taxonomy structure reflects the 
similarities between instances of C . 

Formally, let C be a concept, the type of an object- valued attribute a and let 
root(C') be the root class of C . Let also o'l, be a couple of instances of C and 
let c = oj V O 2 be the most specific common class of o^, O 2 . Class dissimilarity 
is then the ratio of the number of objects in the class and the total number 
of concept instances. Thus, the more specific the class, the less dissimilar the 
objects. 






||(TOeTO&ers(c))|| — 1 
||mem&ers(roof(C"))|| 



( 3 ) 



where members{) returns the set of member objects of a class. Here one stands 
for the average of members{) on both objects. 

The above function allows the existing taxonomic knowledge in the KB to 
be used for clustering, that is for building of new taxonomies. 

In case of multi-valued attributes, both relational and of property nature, a 
specific comparison strategy must be applied since collections may have variable 
length. The resulting collection dissimilarity relies on a pair-wise matching of 
collection members prior to the computation. Space limitations do not allow 
the point to be extended here, but an interested reader will find a description 
of a multi-valued dissimilarity in El- All functions defined in that paper are 
consistent with the above dissimilarity model in that they are normalized and 
represent valid dissimilarity indices. 



3.3 Dealing with Circularity 

Circularity arises when strongly connected components occur in object networks. 
A simple example of such component is the two-way dependency between flat 
and human concepts established by the owner and house attributes. If the dis- 
similarity of a couple of humans who own their residences is to be computed, 
this depends, via the house attribute on the dissimilarity of the respective flats. 
The flat dissimilarity depends, in turn, on the dissimilarity of the initial objects 
via owner. Both values depend recursively on themselves. A possible way to deal 
with such a deadlock is to compute the values as solutions of a system of linear 
equations (see | 2 |). 

^ not to mix with component attributes 
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A single system is composed for each strongly connected component that 
occurs in both networks. In the system, the variables xi correspond to pairs of 
objects which may be reached from the initial pair o, o' by the same sequences of 
relational links (no composition). For each couple, let’s say there are m couples, 
an equation is obtained from Formula ^ 



m 

= hi+ ^ Cj J * Xj . (4) 

i=3,i¥^3 

Here, bi is the local part of the dissimilarity that is the sum of all dissimi- 
larities on properties and components plus those links which does not appear in 
the strongly connected component (so there are no variables representing them 
in the system). The remaining dissimilarities are on link attributes which take 
part in the circular dependence. 

The coefficients Cij in each equation are computed as follows. If the objects 
of a couple corresponding to xj are the respective values of an attribute a in the 
objects of the Xi couple, then Cij is the weight of the attribute a. Otherwise, Cij 
is 0. 

The obtained system is quadratic (m variables and m equations). 

C*X = B. (5) 

The matrix C is diagonal dominant so the system has a unique solution. It 
can be computed in a direct way or by an iterative method. 

The measures one per concept C, obtained by such a computation are 
valid dissimilarity indexes, since positive, symmetric and minimal. Moreover, it 
can be proved that if all d/ functions are metrics, then d^ are metrics too. 

4 An Example of Dissimilarity Computation 

In the following, we shall exemplify the way d° is computed and used to cluster 
objects. Due to space limitations, we only consider a sample dataset made ex- 
clusively of human instances with three attributes: age, salary and spouse (see 
Table ^). spouse attribute is of a link nature and its type is the human concept 
itself. It thus establishes tiny cycles of two instances of the human concept. The 
salary attribute indicates the month’s income of a person in thousands of eu- 
ros, it is of float type and ranges in [1.8, 3.3] whereas the age attribute is of 
integer type and ranges in [20 , 35] . 

Let’s now see how the value of d° is computed for a couple of objects of the 
set, say ol and o4. First, we fix the attribute weights at 0.4 for spouse, and 
0.3 for both salary and age. According to Formula H the dissimilarity for ol 
and o4 becomes: 



dlumanioh o4) = 0.3 6^26, 30) + 0.3 S^{3, 2.7) + 0.4 s\o2, o3) . (6) 
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Table 1. A sample dataset of human instances 



attribute |ol o2 o3 o4 o5 o6 



age 


26 23 27 30 32 35 


salary 


3 2.3 2.2 2.7 2.9 3.2 


spouse 


o2 of o4 o3 o6 o5 



The first two differences are computed by a property function of S taking into 
account the respective domain ranges. The total of both computations amounts 

—f 

to 0, 18. As no taxonomy is provided, S on spouse is replaced by ■ 



d°humauio2, o3) = 0.3 /(23, 27) + 0.3 <5^ (2.3, 2.2) + 0.4 /(of, o4) . (7) 

Here we come to the mutual dependency of o3) and o4) 

due to the spouse circuit. Substituting a variable for each of them, lets say x\ 
and X 2 , the following linear equation system is obtained : 



xi = 0.18 -1- 


0.4x2 


(8) 


X2 = 0.14 -1- 


0.4xi 


(9) 



Table 2. Dissimilarity values for 4«man and 



|ol 


o2 


o3 


o4 


o5 


o6 


ol 


0 


0.38 


0.32 


0.3 


0.33 


0.51 


o2 


0.38 0 


0.23 


0.48 0.65 


0.9 


o3 


0.36 0.26 


0 


0.31 


0.48 


0.73 


o4 


0.28 0.43 


0.31 


0 


0.17 


0.42 


o5 


0.5 


0.61 


0.46 


0.33 0 


0.25 


o6 


0.56 0.74 


0.57 


0.44 0.25 


0 



The solutions for x\ and X 2 are 0 . 28 and 0 . 26 respectively. The values of 
d-human whole dataset are given in Table 0 The table should be read as 

follows : whereas the entries below the main diagonal represent the results of 
the above function, another function, lets call it (ffiumam is given in the 

upper part of the matrix, d^uman i® computed only on age and salary attributes 
taken with equal weights. We put it here to exemplify the specific features of the 
relational measure in case of circularity. 
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A detailed examination of respective values for and leads to 

the following observation. Given a couple of objects o' and o" which represent 
a couple of spouses, lets consider an arbitrary third object o and its dissimi- 
larity with each of the initial objects. Suppose also is less than 

^uraanW' ,o). Then, the following inequalities hold : 

—humani'^ ’ ^ ^ —human {o”,o). (10) 

The same inequalities hold for , o) . Besides, the values of both mea- 

sures for the initial couple remain the same: 

dhumani^ ) —human^'^ ?^ ) ■ (i^i) 

The underlying phenomenon is a kind of attraction between the objects of a 
couple, here o' and o" . Actually, the mutual influence of dissimilarity values 
for both objects to a third one tends to minimize their difference. This means 
that, within the space induced by the relational measure, both objects will lie 
somehow ’’nearer” than in the space induced by the simple measure, even if the 
absolute values of both functions remain equal. 

Globally, the attraction results in a more compact dissimilarity matrices in 
the sense that the total variance of the values tends to decrease with respect to a 
non-relational measure with the same relative weights for property attributes. Of 
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0.2 _ 



ol o4 o5 o6 o2 o3 ol o4 o2 o3 o5 o6 




Fig. 3. Glustering results for the dataset: A. single linkage clustering with d^uman 
as input; B. single-linkage clustering with as input; G. complete-linkage 

clustering with as input. The scale for first two dendrograms is given on 

the left and the scale of the last one on the right of the figure. 



course, the above attraction phenomenon has been detected in a very particular 
situation where couples of objects of the same concept are strongly connected. 
However, a similar tendency can be observed in case of strongly connected com- 
ponents of greater size and of heterogeneous composition. 

5 Clustering 

The matrix obtained by the computation of d° is used as an input for a hierarchi- 
cal clustering algorithm which detects homogeneous object groups. An example 
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of clustering results may be seen on Fig. 0where input data has been taken from 
Table El 

Thus, Fig. El A shows the result of a single linkage clustering on d^uman values, 
whereas the dendrograms on Fig. 0B and Fig. 0C are obtained with by 

a single-linkage and complete linkage algorithms respectively. 

A first remark concerns the attraction between related objects which goes 
as far as to change the dissimilarity-induced order between object couples. This 
could be observed on the matrix, but the dendrogram shows it even better. In 
fact, whereas o4 forms a compact class with o5 in the first case, on the second 
dendrogram it is combined with ol. This shift is undoubtfully due to the influence 
of the o4’s spouse, o3, which is nearer to the ol’s spouse then to o5’s. Both ol 
and o4 are attracted to form a class with their spouses which may be seen on 
both dendrograms B and C. 



human-cl#3 




Fig. 4. Part of the class hierarchy obtained after characterization, of classes 
found by the hierarchical clustering. 



As far as class inference is concerned, the clusterings with suggest 

the existence of four classes below the root class. Before presenting them for 
user’s validation, a characterization of each class in terms of attribute restric- 
tions has to be provided. The description shows the limits of the region in the 
concept space represented by the class and thus helps in the interpretation. 
Fig. El shows the hierarchy made of three classes, corresponding to the object 
clusters {ol,o4}, {o2,o3} and {ol ,o4,o2,o3}. Observe the cross-reference be- 
tween classes human-cl#l and human-cl#2 established via the spouse attribute. 

6 Conclusion 

An approach towards the automatic class design in object languages has been 
presented in the paper. Classes are built in two steps: first, member sets of classes 
are discovered by a proximity-based clustering procedure, then, each class is 
provided a characterization in terms of attributes. 

The dissimilarity measure used for clustering compares objects with respect 
both to their properties and their relations. In case of self-references in object 
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descriptions, the values of the measure are computed as solutions of a system of 
linear equations. In addition, the measure allows available taxonomic knowledge 
to be explored for object comparison, but does not require taxonomies to exist. 
In sum, the measure is complete with respect to object descriptions and therefore 
allows the detection of clusters which are homogeneous on all object attributes. 
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Abstract. An algorithm is proposed which adaptively and simultane- 
ously estimates and combines classifiers originating from distinct classi- 
fication frameworks for improved prediction. The methodology is devel- 
oped and evaluated on simulations and real data. Analogies and similar- 
ities with generalised additive modeling, neural estimation and boosting 
are discussed. We contrast the approach with existing Bayesian model 
averaging methods. Areas for further research and development are in- 
dicated. 



1 Introduction 

There has been much recent interest in the combination of classifiers for im- 
proved classification. Much of this work applies to the combination of classifiers 
which are of a similar type or nature and is thus essentially concerned with 
estimation. An early example in the parametric context may be found in Fried- 
man’s regularized discriminant analysis m- Buntine (3, Hastie and Pregibon 
m, Quinlan m and Oliver and Hand HZI generalize such approaches to the 
averaging of classification trees. From a conceptual point of view, these methods 
may be formulated either as direct applications or equivalent to Bayesian model 
averaging approaches within a fixed methodological framework. In some cases, 
such as parametric discrimination, we can postulate a ‘global’ probability model 
on the model probabilities within the framework. When such probability mod- 
eling is not straightforward, a careful consideration of the sources of variation 
in calibration may often suggest reasonable analogue implementations of model 
averaging to combine distinct within-framework model estimates in\m- 

While these developments to fully account for model uncertainty in a con- 
strained and a priori methodological setting are encouraging, we could argue that 
the evaluation of model uncertainty in applied statistics and data analysis should 
be given a much broader interpretation. Above all, the assessment of model un- 
certainty should be concerned with the choice of the specific methodological 
framework itself within which any subsequent analysis is carried out. Classifica- 
tion studies provide an excellent example of this problem, as most studies are 
faced, at the first instance, with a choice between vastly different methodologi- 
cal approaches such as linear and quadratic (classical parametric) discrimination 
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versus machine learning approaches (such as CART), nearest neighbour methods 
or neural networks, among others. Furthermore, such issues of choice will often 
be much more subtle than those of estimation within any given framework and 
the effects of an unfortunate choice of classification method may be more serious, 
irrespective of the care and attention which could subsequently be applied in the 
calibration of a specific classifier. We could of course argue that model averaging 
methods could be generalised in practical applications to derive ad hoc com- 
binations of predictions from a wider set of models. However, both theoretical 
argument as well as empirical evidence would suggest that we may not be 
able to rely on such implementations to consistently provide good classification 
rules in all applications. As a consequence, and perhaps paradoxically, the Bayes 
paradigm would appear to break down precisely in those situations where model 
uncertainty is most acute. 

This paper develops a methodology which can combine classifications from a 
class of conceptually distinct classification methods. We focus on the combina- 
tion of linear discriminant rules with tree structured classifiers as an example. 
Our methodology derives from an interpretation of and an analogy with neural 
approaches to the combination of statistical models. We develop and illustrate 
the methodology through simulations and application on real data. 

2 Adjusted Estimation 

As a solution within the confines of classical Bayesian probability modeling does 
not readily suggest itself, we may explore related disciplines for inspiration to 
rejuvenate this area of research. A highly interesting class of models may be 
found in the artificial intelligence literature. Several interpretations of neural 
network modeling have already been put forward, among which those of gen- 
eralised function estimation and prediction m and projection pursuit m- An 
alternative interpretation from a Bayes perspective would be to view neural es- 
timation as averaging across the models in the interior layer of the network. 
However, notwithstanding the appeal of this analogy, a key difference between 
general model averaging-type methods and neural networks resides in the fact 
that the latter jointly calibrate the models in the hidden layer in some sense such 
that each model takes the best form in the context of the other data summaries 
which are present. In contrast, model averaging simply combines existing mod- 
els. Restricting ourselves to the two-class problem, these considerations suggest 
that we may generalize model averaging to the between-model setting to allow 
calibrations of the type 



/(I I x) = a-k^ 5 (x I Mi), 
tel 

where x is the measurement vector, each gi(x | Mi) is a predictor within the 
model Mi, A4 = {Mi, . . . , Mj} is the set of models and where the estimation 
of each predictor g(x | Mi) is adaptive or adjusted in the presence of the other 
models and their respective calibrations. Of course, for a completely general 
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between-model set A4 the usual procedures for combined estimation and combi- 
nation used in neural network estimation will break down. Specifically, for our 
model set of linear discrimination and CART, none of the traditional optimiza- 
tion methods seem to apply as there are no simple updating methods for the 
classification tree. Hence, we need to develop a general purpose algorithm which 
can adaptively combine and adjust, not just predicted class indicators or proba- 
bilities, but rather the optimization of the respective constituent models which 
we want to combine. We discuss the solution to this general problem by restrict- 
ing to the case of the prediction of class indicators, and thus regression, first. The 
generalization to the prediction of class probabilities is discussed subsequently. 
We will restrict ourselves to two-class problems. 



2.1 Predicting Class Indicators 

The problem may be simplified by defining /(I | x) = Sij, with j the class 
indicator. This reduces the problem to the calibration of a regression equation for 
the prediction of class indicators. We can then propose an optimization criterion 
such as least squares and consider optimization of 

n 

- a - 5 (xfe I □, LR) - g{xk \ A, Tree)}^, 

fc=i 

where Sij^, k = are the class indicators for each of n observations, 

a is a constant and g(xk \ D,LR) and g(xk \ \,Tree) are predictions from 
a linear regression model and a regression tree and for each observation 
with feature vector x^. The parameters □ and A are the regression parameters 
of the linear model and a parameter which defines the size of the regression 
tree in some sense, respectively. Hence, we have for our choice of models that 
g{xk I U,LR) = x^n and g(xfc | X,Tree) = o,iBi{xk) is a regression tree, 
where the H/(xfc) = /[x^ G Ri] are the basis functions defined on the hyper- 
rectangles Ri which are derived by the tree fitting algorithm ini- 

The above simplification of the problem suggests optimization of the func- 
tions g(xk I n,LR) and g{xk \ X,Tree) through an iterative scheme wherein 
we alternate between optimizations of LR and CART, each time keeping the 
predictions from the other constituent method fixed and then optimizing for 
the calibration of the residual which remains. The parameter a is kept fixed at 
the mean of the class indicator to reduce the redundancy of the specification. 
Thus, an alternating least squares scheme emerges which defines a backfitting 
procedure in a more statistical manner as compared to the more usual neural 
approach. It should be applicable to the combination of a wide class of models. 
Calibration of the classification tree at each iteration may require pruning or 
some other approach to reduce the variability of the predictor, which may be 
achieved through cross-validation or a set-aside test set. Table d shows the struc- 
ture of the algorithm, which applies to general (univariate) regression problems. 
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Table 1. The alternating least squares combination algorithm. 

(1.) Initialize Put iteration counter i = 0. 

Put i/fc = k = and oo = mean(y), where y = (yi, . . . , i/n)^ and 

(?o(xfe I f3o,LR) = <jo(xfc I XojTree) = 0, k = Mean-center both the 

matrix of predictors X = (xf, . . . ,x^)^ and predictand y and save the means. 

(2.) Update Put iteration counter i = i -|- 1. 

Fit the model (?i(xfc | (3^,LR) to the residuals yk — | Ai_i, Tree) and 

grow the tree giijx.k \ Tree) to the residuals — Qii'x.k \ f3^,LR). Derive the 
optimal constraining parameter Ai for the unpruned tree giigik \ Tree) (see later) 
and stabilize the tree size Ai <— /(Ai, . . . , Ai) . Apply to derive the pruned tree 
ffi(xfc I Xi,Tree). 

(3.) Verify Check convergence of the relative change of Li = 'Y^{yk — 3 i(xfc | /3j, LR) — 
giig^-k I XiyTree))^ and repeat step 2 if necessary. 



Simulations We evaluate simulations as a first step in developing and validat- 
ing the proposed methodology. Four simulated datasets were generated in two 
dimensions, each of which consists of 225 samples. For each of these simulations, 
100 observations were generated from each of two bivariate normal distribu- 
tions with means (0,1) and (O.-l) respectively and common covariance matrix 
diag(4, 0.25). We then simulated 25 observations from a contaminant spherical 
normal distribution with locations (4,2), (2,3), (0,-2) and (0,-1) for the first, 
second, third and fourth simulations respectively and a covariance matrix of 
diag(0.05,0.05). and all the data was rotated 45 degrees counter-clockwise. Fig- 
ure 0show pictures of each of the resulting simulations. Validated classifications 
were derived for each of the four simulation models which were considered. The 
validation samples were of the same size as those used for calibration. Although 
simulation could also have been used to prune the regression trees, we chose to 
use leave-one-out cross-validation to make the simulations more realistic. 

Table 0 displays the validated classifications when shrinkage-based pruning 
muni is used for constraining the fitting of each tree and based on cross- validated 
least squares lack-of-fit as a measure of the deviance of each tree. The shrinkage 
parameter was found to be highly variable during iterations and hence some 
constraint has to be placed on the choice of the pruning parameter A. A criterion 
/(Ai, . . . , Ai) which depends on the pruning parameters calibrated in previous 
iterations and the newly proposed pruning parameter Ai at the current iteration 
i and which becomes gradually increasingly resistant to changes in the pruning 
parameter seems appropriate. A simple rule based on a simple averaging of 
at most the past ten pruning parameters, possibly augmented by a form of 
robust averaging in later iterations was found to work well. Such a procedure is 
analogous to backstopping in more conventional Gauss-Newton approaches. The 
table shows the proportions of misclassified observations for the combination, 
ordinary linear discrimination and those from a regression tree fit to the class 
indicators. The parameters □ = (/3i, (32) of the final linear discriminant predictor 
g(xfc I □, LR) within the optimized model combination are shown in the table 
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together with their Euclidean norm and similarly, for the linear discriminant 
model only. An estimate of the size A of the final regression tree g(xk \ \,Tree) 
of the model combination is also given. 
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Fig. 1. Pictures of the simulated calibration data of two-class classification 
problems and for each of the first four simulations. 



As may be seen from the table, the combination approach does extremely 
well for all simulations except for the second. The interpretation of these results 
is that the classification tree is able to remove the contaminating samples from 
the calibration of the linear discriminant model, as may be seen from the linear 
discriminant coefficients which are calibrated, particularly for the first simulation 
as one would expect. The second simulation is an exception, due to the failure 
of cross- validation-based shrinkage in the alternating least squares algorithm to 
sufficiently prune the tree. Recalibration of the model combination using newly 
simulated data from the appropriate model to select the pruning parameter gave 
similar results as for the other simulations. Generally and across all simulations, 
the tree sizes tend to be too large, which may again be explained by the use of 
cross- validatory estimation (see further in a discussion on convergence) . 



Data The above simulations have shown that the combination method can 
improve classifications across the constituents, but also, allows for a proper and 
correct identification of the relevant model constituents which are present. Hence, 
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Table 2. Validated classification results from the combination of models through 
alternating least squares as well as for linear discrimination and CART separately, 
using cross-validation based shrinkage for the calibration of the regression trees. * 



Failed to converge at iteration 100, 
9.1e-4. 


relative change of the least squares criterion 


Simulation 


Classification Method 


Combination Parameters 


LR Parameters 




COMB Tree LR 


/3i P2 11/311 A 


Pi P2 11/3 II 


First 

Second* 

Third 

Fourth 


0.040 0.173 0.213 
0.088 0.098 0.027 
0.058 0.129 0.138 
0.084 0.116 0.164 


0.258 -0.262 0.368 2 
0.146 -0.125 0.192 6.6 
0.277 -0.306 0.413 3.2 
0.306 -0.277 0.413 3.9 


0.131 -0.257 0.289 
0.220 -0.197 0.295 
0.151 -0.172 0.229 
0.247 -0.221 0.331 



we investigated the performance of the approach on real data, all of which derive 
from the Statlog project. Seven data sets were investigated, for each of which 
both a calibration set was defined and a test set put aside. Each of these either 
represents (Pima Indians Diabetes, Heart Disease, Australian Credit) or was 
reduced (Handwritten Digits, Letter Image Recognition) to two-class problems. 
The analysis and experiment focused on the use of the continuous variables only. 
For the first three data sets, only variables 2 to 8 were used for the diabetes data, 
variables 1,4, 5, 8 and 10 for the heart data and only variables 13 and 14 for the 
Australian credit data. For the letter image data as well as the digits data, all 
sixteen attributes were used in the construction of the classification rules. Only 
discrimination between digits 1 and 7 was investigated for the digit data and 
similarly, the two-class classification problems for distinguishing between the 
letters o and q, i and j and b and e were investigated for the letter image data. 
All other classes were removed from the data in the investigation of each of these 
problems. 

The results presented in table0(misclassified observations with misclassifica- 
tion proportions in brackets) are broadly favourable to the combination method, 
with the methodology either at least equivalent in terms of misclassification rate 
to the optimal single model or predictor (Heart, Credit, Letters (be)) or im- 
proving on the constituents (Diabetes, Letters (oq)). The Digit (17) data and 
the Letters (ij) are the exceptions. For the digits this appears due to insufficient 
shrinkage which eliminates the linear discriminant model from the combina- 
tion, which seems to confirm previous results on potential cross- validation-based 
shrinkage problems from the simulation experiments. On the other hand the 
methodology appears capable of identifying a single optimal model if one exists 
(Heart data: A = 1). 



Convergence We have used and motivated alternating least squares as an ad 
hoc method for the combination of distinct models. Breiman and Friedman |2j. 
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Table 3. Validated classification results on real data and for the comparison of 
the alternating least squares combination method, linear discrimination and CART 
separately. Both the number of misclassified observations and the misclassification rate 
are given. 



Classification Method Model Parameters 





Combination 


Tree 


LR 


II/3II 


A 


Diabetes 


39 (0.232) 


43 (0.260) 


44 (0.262) 


0.118 


1.4 


Heart 


32 (0.320) 


40 (0.400) 


32 (0.320) 


0.147 


1 


Credit 


92 (0.317) 


91 (0.314) 


118 (0.407) 


0.048 


3.6 


Digits (17) 


25 (0.014) 


27 (0.015) 


19 (0.011) 


2.22e-3 


5.9 


Letters (oq) 


9 (0.025) 


19 (0.052) 


18 (0.049) 


0.213 


17.3 


Letters (ij) 


28 (0.078) 


23 (0.064) 


33 (0.092) 


0.132 


9.4 


Letters (be) 


4 (0.010) 


5 (0.012) 


21 (0.052) 


0.029 


7.3 



Buja, Hastie and Tibshirani |S|, Buja ^ and Hastie and Tibshirani H3 investi- 
gate the existence of optima and convergence for alternating least squares and 
alternating conditional expectation algorithms and for the case of estimating 
optimal transformations. While our application is different, we have an addi- 
tional problem in the use of cross-validation which is intrinsic in the fitting of 
the regression tree. This is analogous to difficulties in using alternating least 
squares with the supersmoother for estimating optimal transforms of data. As a 
consequence, the algorithm is no longer a strict alternation between pure least 
squares fits. The simulations confirm this problem (second simulation) and ex- 
amples can be found where the sequence of models fails to settle, due to the 
inability of cross-validation to identify a realistic pruning parameter. In all cases 
we have seen however, convergence problems have been attributable to cross- 
validation and disappear when either a user-specified pruning parameter was 
supplied and kept fixed throughout alternations or by using set-aside test sets. 
A more subtle difficulty may reside in the use of a regression tree which may not 
have the required properties as a ‘data smooth’ |5|. As discussed above, we have 
not found evidence of this in practical applications, which mirrors experience by 
Breiman and Friedman. 

Of potentially greater importance is the question whether and how the me- 
thodology can identify good and realistic model combinations in practice and 
from the point of view of classification. Figure El shows pictures of the first simu- 
lation with the separating surfaces from linear discrimination and the regression 
tree superimposed, as well as those from the same models within the alternating 
least squares model combination. It is not surprising that linear discrimination 
should have a problem with this simulation, as the estimation of the within-group 
covariance matrices for the model corresponding to the first 200 observations will 
be biased due to the presence of the contaminant model. As a consequence, the 
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angle of the linear discriminant surface relative to the first axis is 27° (as op- 
posed to 45°). All observations from the contaminant model are misclassified, 
in addition to misclassifications in the first 200 observations due to the inappro- 
priate orientation of the separating surface relative to these data. Likewise, the 
regression tree can not well approximate the first diagonal due to the rectan- 
gular shape of the generated basis regions which is inherent in the specification 
of the method and this problem is exacerbated by the small sample sizes. More 
surprisingly however, it also fails to fully isolate the relatively compact cluster 
of contaminant data as well. As a consequence, neither model can cope with the 
data as given here. In contrast, the alternating least squares model combination 
separates the data by reducing the regression tree to a single split on the first 
axis only (coordinate value 3.2). This effectively removes the contaminant data 
from the calibration of the linear discriminant model and as a consequence, the 
angle of the calibrated linear discriminant surfaces returns to 45°. The model 
combination thus identifies what is effectively an optimally separating model. 
Similar conclusions and discussion apply for the other simulations. 



LDA 



Tree 



Combination 






Fig. 2. Pictures of the simulated calibration data and discriminant models for 
the first simulation. The pictures on the left are for linear discrimination and the 
regression tree only. The right figure shows the separating surfaces from the same 
two models within the model combination derived by alternating least squares. 



2.2 Predicting Probabilities 

We may generalize the above approach to the prediction of the posterior prob- 
abilities of class membership by postulating the generalised dependent variable 
/(I I x) = logit (p(l I x)). The alternating least squares method may then be 
applied to the linearisation of the corresponding prediction formula 

logit(p(l I x)) = a -k ^ g(x I Mi) 
iGl 
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by deriving the adjusted dependent variable 

z = a + g(x I □, LR) + g(x | X,Tree) H ^ ^ 

where p = p{a + g{x | □, LR) + g(x \ \,Tree)) with p{x) = exp(a;) /(I + exp(a:)) 
(note p is short notation for p(l | x)). We then fit the combination of classification 
models to the adjusted dependent variable through the above described weighted 
alternating least squares method, where the weights are given asrc = p*(l — 
p). The fitted combined models define the update of the adjusted dependent 
variable and the weights for the next iteration and the procedure is repeated 
until convergence. This approach which is essentially an application of the local 
scoring algorithm thus embeds the previous iterative alternating least squares 
method for the calibration of class indicators in a second iterative layer and for 
the optimization of the log-likelihood 

L = ^vMp) + (1 - y) ln(l -p). 

The methodology proposed above was found to work on the strict condition 
that a backstepping procedure is enforced at each stage to ensure that the in- 
ternal alternating least squares algorithm identifies model combinations which 
improve the log-likelihood. This may be done for the parameter a and the 
parameter vector □ from the linear discriminant model by postulating mix- 
ing parameters 0 < 7a , 7 l < 1 and then evaluate the log- likelihood for the 
backstep estimates ^ 7a * cxi-i + (1 — 7a) * cXi and gi(xk \ D^,LR) ^ 
7 l * pi_i(xfc I Di_-^,LR) -I- (1 - 7 l) * gt(xk \ 0^, LR) where the subscripts i 
and i — 1 identify the estimates with respect to the corresponding models after 
and before the present iteration i, respectively. Devising a similar backstep- 
ping procedure for the regression tree is more complicated as both the redefi- 
nition of the spatial structure (basis functions) at each step as well as the use 
of pruning or shrinkage precludes a direct application of the above backstep- 
ping method. We may generalize the approach to trees in the following manner. 
Using similar notation as before, we will let gi(xk \ Tree) and gi-i(xk \ Tree) 
represent the unpruned trees from the present and previous iterations with pro- 
posed corresponding optimal pruning parameters Xi and Ai_i, as derived ac- 
cording to the algorithms described in the previous section. We may backstep 
with a two-stage approach by first backstepping the shrinkage parameter only 
Ai ^ 7 t * Ai_i -I- (1 — 7 t) * Ai for some choice of 0 < 7 t < 1 and then ap- 
ply the backstep estimate Ai to derive the backstep regression tree predictions 
g^(xk I X^,Tree) ^ -fT*gi-i(p^k \ X^,Tree)+{l--fT)*gi(p^k \ Xi,Tree) which may 
then be used to define the updated adjusted dependent variable z for the next it- 
eration. The optimal combination of backstepping parameters 0 < 7a, 7l, 7t < 1 
may be identified by a three-dimensional grid search on the unit cube whereby 
the log- likelihood is evaluated for each choice and the minimum identified. The 
generalisation to the multiclass case is from the formulaic point of view identical 
to that of generalising logistic discrimination. 
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Table 4. Validated classification results from the combination of models through 
adjusted estimation as well as for linear discrimination and CART separately, using 
cross-validation based shrinkage for the calibration of the regression trees. Parameters 
are given for the combination. 



Classification Method 



Model Parameters 



Combination Tree LR a ( 5 \ (52 ||/3|| A 



First simulation 
Second simulation 
Third simulation 
Fourth simulation 



8 (0.036) 32 

7 (0.031) 21 

17 (0.076) 30 
19 (0.084) 27 



(0.142) 48 (0.213) 
(0.093) 6 (0.027) 
(0.133) 31 (0.138) 
(0.120) 37 (0.164) 



-0.349 2.911 -3.030 4.202 2.2 
-1.367 4.722 -3.839 6.086 7.4 
-0.248 2.562 -2.845 3.828 4.2 
-0.174 4.184 -4.122 5.873 2.8 



Simulations Results from the application of this methodology to the simu- 
lations are shown in table 0 These are broadly in line with those from the 
prediction of class indicators (table I3) and with exception of the second simu- 
lation, where the adjusted estimation algorithm is clearly superior. This is due 
to the application of backstepping in the identification of the optimum, which 
does not feature in the basic alternating least squares algorithm. Problems with 
cross-validation and the choice of the optimum tree size apply to both algorithms 
and for this simulation specifically. Readers may note small differences between 
the results on CART as compared to those shown in table |2| This is because 
table 21 shows results from the fitting of a classification tree, which optimizes 
the log-likelihood and predicts probabilities, whereas previous results relate to 
those obtained from a regression tree which is fit to the class indicators using the 
sum of squares of deviations as a measure of homogeneity. The same problem 
does not apply to the linear model as the discriminant surface is identical in 
both cases. With respect to the first simulation, the fitted combined model is 
effectively the same (in terms of the separating surfaces) to that discussed in 
section 2.1.3 and with similar interpretation. 



Data Table El shows results from the adjusted estimation algorithm for the 
same data as discussed before. Results are again comparable as before, with the 
exception of the digits (17) and the letters (ij) for which the adjusted algorithm 
clearly is superior. This again demonstrates the importance of the backstep. Most 
importantly, the combination either improves on the constituents or equals the 
best method, across all examples. 



3 Interpretation and Discussion 

We have developed and explored an ad hoc algorithm for the combination of 
predictors from models which originate from conceptually different classifica- 
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Table 5. Validated classification results on data and for the comparison of the ad- 
justed estimation combination method, linear discrimination and CART separately. 
Both the number of misclassified observations and the misclassification rate are given. 



Classification Method Model Parameters 





Combination 


Tree 


LR 


a 


m 


A 


Diabetes 


39 (0.232) 


43 (0.260) 


44 (0.262) 


-0.644 


0.720 


1.4 


Heart 


33 (0.330) 


37 (0.370) 


32 (0.320) 


-0.335 


0.870 


1 


Credit 


97 (0.334) 


94 (0.324) 118 (0.407) 


-0.028 1.37e-3 


3.3 


Digits (17) 


11 (0.006) 


19 (0.011) 


19 (Oil) 


0.232 


0.0783 


4.9 


Letters (oq) 


11 (0.030) 


23 (0.063) 


18 (0.049) 


0.247 


4.970 


16.9 


Letters (ij) 


26 (0.072) 


27 (0.075) 


33 (0.092) 


0.082 


2.276 


17.1 


Letters (be) 


8 (0.020) 


9 (0.023) 


21 (0.052) 


-0.053 


2.839 


8.2 



tion frameworks. It is important to note that the methodology is of a cru- 
cially different nature from methods based on the combination of committees 
or ensembles of predictors, such as may be found in bagging | 2 | or stacking 
m approaches or more general model averaging and Bayesian procedures [TTlj . 
Model averaging effectively operates on models which have already been defined 
and combines them by removing variance through a smoothing of overcom- 
plex models to avoid overfitting (e.g. regularized discrimination). In contrast 
to these methods, alternating least squares and adjusted estimation simultane- 
ously adapt the estimation of the constituent models and combine in a manner 
which is similar to the estimation methods employed in generalized additive 
modeling m- The difference from generalised additive modeling estimation is 
that the additivity constraint is sacrificed to estimate a more general predictor 

/(I I x) = a-bgi(x)H \-gi{x) as opposed to /(I | x) = a+gi{xi)-\ \-gi{xi), 

where the set {xj = : 1, . . . ,/} represents a partition of x such that 

Ji U . . . U J/ = {1, . . . ,p} and with p the number of predictors. The latter formu- 
lation points to the possibility of deriving model combinations which are inter- 
mediate between generalized additive modeling and the combinations derived in 
this paper and by relaxing the partitioning assumption. 



3.1 Boosting and Adjusted Estimation 

Hastie and Tibshirani discuss the estimation of predictors which are additive on 
partitions (pages 271-274) and with specific reference to combination of models 
with regression trees for the modeling of interactions. In addition to the above 
remarks on additivity however, the analysis described in those pages fits the re- 
gression tree last and on the residuals from the previous model fits with no fur- 
ther iteration. The addition of the regression tree thus can have no effect on the 
fit of the previously added models and most importantly on the definition of the 



328 



Bart J.A. Mertens and David J. Hand 



selected basis functions. In contrast, our procedure may be viewed as a method 
which ‘boosts’ the performance of each constituent model by implementing a 
fully alternating estimation. Indeed, as suggested by the simulations, it appears 
that the method effectively operates by removing influential observations which 
are inconsistent with a specific constituent model or method from the calibration 
of that method and then assigns them to the calibration of another classifier. 
Thus, while traditional Bayesian methods distribute models across data, our 
method distributes each datum to the most appropriate classifier. Another way 
to formulate this phenomenon is to say that the algorithm effectively identifies 
clusters and then assigns these clusters to an appropriate classifier or model. 
This behaviour of alternating least squares methods has been noted before by 
Buja P] who complains about the somewhat anecdotal nature of the evidence 
with respect to clustering, a problem which our paper unfortunately does not 
address. Finally, we should note that new work has recently emerged on the 
analogy between boosting algorithms 0I2IJ and the generalised additive esti- 
mation methods while the research presented in this paper was carried out. H2! 
describes how the boosting algorithms of Freund and Schapire are of the same 
form as those used for the optimization of the likelihood function for generalised 
additive estimation. Perhaps not surprisingly, our algorithm is again of that 
class, but with the difference that we only keep the final model combination 
from the sequence of models identified by the alternating algorithm and that 
we optimize across distinct frameworks of models whereas Hastie, Freund and 
Schapire apply boosting to the estimation of a single model only. Our remarks 
may enhance this discussion by pointing to potential links between boosting, in- 
fluence analysis and clustering, which would provide a more elaborate statistical 
interpretation of the methodology. Clearly however, further research is required 
to evaluate such interpretation which the authors may pursue. 

3.2 Neural Networks, Modern Classification Methods, and 
Adjusted Estimation 

We have already amply discussed the adaptive nature of the strategies deployed 
in this paper and their (at least) conceptual similarities to the calibration of mod- 
els in nodes within neural network prediction structures. From the more classical 
perspective of generalised prediction equations, and keeping all remarks about 
adaptive estimation in mind, we should note that the form of the predictors of the 
combined models discussed in this paper, as well as those from model averaging 
and more general combination methods, may be viewed as special cases of the 
general predictor /(I | x) = ft,(gi(x), . . . ,g/(x)). This may be constrained to the 
case of purely linear combinations of predictors /(I | x) = a -|- X) which 

is structurally of the form of the predictors which are constructed in model 
averaging, generalized discriminant functions 0 and adjusted estimation. We 
could reduce this further to derive the sequence of subsequent specialisations 
/(I I x) = a -f 'Y^gi{ai -f as in projection pursuit or neural networks, 

/(I I x) = a + 9ii^i)j as in generalised additive modeling and eventually lin- 
ear prediction /(I | x) = a + x^ p. In terms of the structure of the predictor only. 
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Fig. 3. Boxplots of validated misclassification rates for repetitions of the previ- 
ously discussed four simulation experiments (first to fourth: from left to right). 
Misclassification rates are shown for adjusted estimation (C), Tree only (T), 
LDA only (L) and a neural network (N). 



our models may thus be viewed as a generalization of either projection pursuit or 
neural networks and generalised additive modeling, at least because we allow the 
interior ‘nodes’ to be, in principle, any general model. While the methodology 
may thus be able to provide adequate predictors for a wider class of classifi- 
cation problems, this raises the problem of potential overfitting of the data. 
Figure El shows boxplots of validated misclassification rates for 10 repetitions of 
the previously discussed four simulation experiments, each time simulating new 
calibration and validation data. For each simulation, the combination of linear 
discrimination and tree was fitted and evaluated and similarly for linear discrim- 
ination and the classification tree only. It is clear that the combination method 
compares very favourably. Comparisons with a feed-forward neural network with 
two hidden logistic nodes as well as a logistic output node and maximum like- 
lihood fitting are also included for the same simulations (courtesy of the nnet 
procedure 1221 ). With exception for the second simulation for which neural net- 
works reproduces the linear discrimination results, the method is not competitive 
for the other problems. Crucially, there is no evidence of overfitting with respect 
to adjusted estimation. The reasons for this are as yet not clear but may perhaps 
be found in analogies with penalized least-squares fitting, as is described for the 
more conventional applications of generalised additive estimation by Hastie and 
Tibshirani (page 110). Again it is clear that further work is needed to evaluate 
this. 
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Abstract. Knowledge based artificial neural networks offer an attrac- 
tive approach to extending or modifying incomplete knowledge bases or 
domain theories through a process of data-driven theory refinement. We 
present an efficient algorithm for data-driven knowledge discovery and 
theory rehnement using DistAI, a novel (inter-pattern distance based, 
polynomial time) constructive neural network learning algorithm. The 
initial domain theory comprising of propositional rules is translated into 
a knowledge based network. The domain theory is modified using DistAI 
which adds new neurons to the existing network as needed to reduce clas- 
sification errors associated with the incomplete domain theory on labeled 
training examples. The proposed algorithm is capable of handling pat- 
terns represented using binary, nominal, as well as numeric (real- valued) 
attributes. Results of experiments on several datasets for financial ad- 
visor and the human genome project indicate that the performance of 
the proposed algorithm compares quite favorably with other algorithms 
for connectionist theory refinement (including those that require sub- 
stantially more computational resources) both in terms of generalization 
accuracy and network size. 



1 Introduction 

Inductive learning systems attempt to learn a concept description from a se- 
quence of labeled examples |1 3|1 7f‘21 j . Artificial neural networks, because of their 
massive parallelism and potential for fault and noise tolerance, offer an attractive 
approach to inductive learning !iniiifin) . Such systems have been successfully 
used for data-driven knowledge acquisition in several application domains. How- 
ever, these systems generalize from the labeled examples alone. The availability 
of domain specific knowledge (domain theories) about the concept being learned 
can potentially enhance the performance of the inductive learning system m- 
Hybrid learning systems that effectively combine domain knowledge with the 
inductive learning can potentially learn faster and generalize better than those 
based on purely inductive learning (learning from labeled examples alone). In 
practice the domain theory is often incomplete or even inaccurate. 

D.J. Hand, J.N. Kok, M.R. Berthold (Eds.): IDA’99, LNCS 1642, pp. 331-|212I 1999- 
[fc Springer- Verlag Berlin Heidelberg 1999 
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Inductive learning systems that use information from training examples to 
modify an existing domain theory by either augmenting it with new knowledge 
or by refining the existing knowledge are called theory refinement systems. 

Theory refinement systems can be broadly classified into the following cate- 
gories. 



— Approaches based on Rule Induction which use decision tree or rule 
learning algorithms for theory revision. Examples of such systems include 

RTLS 0, EITHER |21], PTR [El, and TGCI 0. 

Approaches based on Inductive Logic Programming which represent 
knowledge using first-order logic (or restricted subsets of it). Examples of 
such systems include FOCL ^7| and FORTE |2fH. 

— Connectionist Approaches using Artificial Neural Networks which 

typically operate by first embedding domain knowledge into an appropriate 
initial neural network topology and refine it by training the resulting neural 
network on the set of labeled examples. The KBANN system as well 

as related approaches 0 and [El offer examples of this approach. 



In experiments involving datasets from the Human Genome ProjectO, KBAN N 
has been reported to have outperformed symbolic theory refinement systems 
(such as EITHER) and other learning algorithms such as backpropagation and 
IDS 23]. KBANN is limited by the fact that it does not modify the network’s 
topology and theory refinement is conducted solely by updating the connection 
weights. This prevents the incorporation of new rules and also restricts the al- 
gorithm’s ability to compensate for inaccuracies in the domain theory. Against 
this background, constructive neural network learning algorithms, because of 
their ability to modify the network architecture by dynamically adding neurons 
in a controlled fashion j 1 4l2tif37] . offer an attractive connectionist approach to 
data-driven theory refinement. Available domain knowledge is incorporated into 
an initial network topology (e.g., using the rules-to-network algorithm of jsni 
or by other means). Inaccuracies in the domain theory are compensated for by 
extending the network topology using training examples. Figure E depicts this 
process. 

Constructive neural network learning algorithms [1 4f2til3 7] . that circumvent 
the need for a-priori specification of network architecture, can be used to con- 
struct networks whose size and complexity is commensurate with the complexity 
of the data, and trade off network complexity and training time against general- 
ization accuracy. A variety of constructive learning algorithms have been studied 
in the literature |4liSll 1 1 1 4l2(ii;-iYj . DistAI [23 is a polynomial time learning algo- 
rithm that is guaranteed to induce a network with zero classification error on any 
non-contradictory training set. It can handle pattern classification tasks in which 
patterns are represented using binary, nominal, as well as numeric attributes. Ex- 
periments on a wide range of datasets indicate that the classification accuracies 
attained by DistAI are competitive with those of other algorithms 27EHI- Since 



^ These datasets are available at ftp://ftp.cs.wisc.edu/machine-learning/shavlik- 
group/datasets/. 
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DistAI uses inter-pattern distance calculations, it can be easily extended to pat- 
tern classification problems wherein patterns are of variable sizes (e.g., strings 
or other complex symbolic structures) as long as suitable distance measures are 
defined m- Thus, DistAI is an attractive candidate for use in data-driven refine- 
ment of domain knowledge. Available domain knowledge is incorporated into an 
initial network topology. Inaccuracies in the domain theory are corrected by Dis- 
tAI which adds additional neurons to eliminate classification errors on training 
examples. 

Against this background, we present KBDistAI, a data-driven constructive 
theory refinement algorithm based on DistAI. 

2 Constructive Theory Refinement Using 
Knowledge-Based Neural Networks 

This section briefly describes several constructive theory refinement systems that 
have been studied in the literature. 

Fletcher and Obradovic 0 designed a constructive learning method for dy- 
namically adding neurons to the initial knowledge based network. Their approach 
starts with an initial network representing the domain theory and modifies this 
theory by constructing a single hidden layer of threshold logic units (TLUs) 
from the labeled training data using the HDE algorithm fp. The HDE algorithm 
divides the feature space with hyperplanes. Fletcher and Obradovic’s algorithm 
maps these hyperplanes to a set of TLUs and then trains the output neuron 
using the pocket algorithm 0. The KBDistAI algorithm proposed in this paper, 
like that of Fletcher and Obradovic, also constructs a single hidden layer. How- 
ever it differs in one important aspect: It uses a computationally efficient DistAI 
algorithm which constructs the entire network in one pass through the training 
set instead of relying on the iterative approach used by Fletcher and Obradovc 
which requires a large number of passes through the training set. 




334 Jihoon Yang et al. 



The RAPTURE system is designed to refine domain theories that contains 
probabilistic rules represented in the certainty-factor format j23- RAPTURE’S 
approach to modifying the network topology differs from that used in KBDistAI as 
follows: RAPTURE uses an iterative algorithm to train the weights and employs 
the information gain heuristic m to add links to the network. KBDistAI is 
simpler in that it uses a non-iterative constructive learning algorithm to augment 
the initial domain theory. 

Opitz and Shavlik have extensively studied connectionist theory refinement 
systems that overcome the fixed topology limitation of the KBAN N algorithm 1221, 
P3|- The TopGen algorithm [22| uses a heuristic search through the space of possi- 
ble expansions of a KBANN network constructed from the initial domain theory. 
TopGen maintains a queue of candidate networks ordered by their test accu- 
racy on a cross-validation set. At each step, TopGen picks the best network and 
explores possible ways of expanding it. New networks are generated by strategi- 
cally adding nodes at different locations within the best network selected. These 
networks are trained and inserted into the queue and the process is repeated. 

The REGENT algorithm uses a genetic search to explore the space of network 
architectures 122 ]. It first creates a diverse initial population of networks from the 
KBANN network constructed from the domain theory. Genetic search uses the 
classification accuracy on a cross-validation set as a fitness measure. REGENT’S 
mutation operator adds a node to the network using the TopGen algorithm. It 
also uses a specially designed crossover operator that maintains the network’s 
rule structure. The population of networks is subjected to fitness proportionate 
selection, mutation, and crossover for many generations and the best network 
produced during the entire run is reported as the solution. KBDistAI is consid- 
erably simpler than both TopGen and REGENT. It constructs a single network 
in one pass through the training data as opposed to training and evaluating 
a population of networks using the computationally expensive backpropagation 
algorithm for several generations. Thus, it is significantly faster than TopGen 
and REGENT. 

Parekh and Honavar ES] propose a constructive approach to theory refine- 
ment that uses a novel combination of the Tiling and Pyramid constructive 
learning algorithms iHini. They use a symbolic knowledge encoding procedure 
to translate a domain theory into a set of propositional rules using a procedure 
that is based on the rules-to-networks algorithm of Towell and Shavlik m which 
is used in KBANN, TopGen, and REGENT. It yields a set of rules each of which 
has only one antecedent. The rule set is then mapped to an AND-OR graph 
which in turn is directly translated into a neural network. The Tiling-Pyramid 
algorithm uses an iterative perceptron style weight update algorithm for setting 
the weights and the Tiling algorithm to construct the first hidden layer (which 
maps binary or numeric input patterns into a binary representation at the hidden 
layer) and the Pyramid algorithm to add additional neurons if needed. While 
Tiling-Pyramid is significantly faster than TopGen and REGENT, it is still slower 
than KBDistAI because of its reliance on iterative weight update procedures. 
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3 KBDistAI: A Data-Driven Theory Refinement Algorithm 

This section briefly describes our approach to knowledge based theory reflnement 
using DistAI. 



3.1 DistAI: An Inter-Pattern Distance Based Constructive Neural 
Network Algorithm 

DistAI |14K37I3?^ is a simple and relatively fast constructive neural network learn- 
ing algorithm for pattern classification. The key idea behind DistAI is to add 
hyperspherical hidden neurons one at a time based on a greedy strategy which 
ensures that each hidden neuron that is added correctly classifies a maximal 
subset of training patterns belonging to a single class. Correctly classified exam- 
ples can then be eliminated from further consideration. The process is repeated 
until the network correctly classifies the entire training set. When this happens, 
the training set becomes linearly separable in the transformed space defined by 
the hidden neurons. In fact, it is possible to set the weights on the hidden to 
output neuron connections without going through an iterative, time-consuming 
process. It is straightforward to show that DistAI is guaranteed to converge to 
100% classification accuracy on any finite training set in time that is polynomial 
(more precisely, quadratic) in the number of training patterns [37| . Experiments 
reported in show that DistAI, despite its simplicity, yields classifiers that 
compare quite favorably with those generated using more sophisticated (and 
substantially more computationally demanding) learning algorithms. 



3.2 Incorporation of Prior Knowledge into DistAI 

The current implementation of KBDistAI makes use of a very simple approach 
to the incorporation of prior knowledge into DistAI. First, the input patterns are 
classified using the rules. The resulting outputs (classification of the input pat- 
tern) are then augmented to the pattern, which is connected to the constructive 
neural network. This explains how DistAI is used for the constructive neural net- 
work in Figured efficiently without requiring a conversion of rules into a neural 
network. 

4 Experiments 

This section reports results of experiments using KBDistAI on data-driven theory 
reflnement for the financial advising problem used by Fletcher and Obradovc 
0, as well as the ribosome binding site and promoter site prediction used by 
Shavlik’s group |zziz;ii;iii;i4i;io| : 

— Ribosome 

This data is from the Human Genome Project. It comprises of a domain 
theory and a set of labeled examples. The input is a short segment of DNA 
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nucleotides, and the goal is to learn to predict whether the DNA segments 
contain a ribosome binding site. There are 17 rules in the domain theory, 
and 1880 examples in the dataset. 

— Promoters 

This data is also from the Human Genome Project, and consists of a domain 
theory and a set of labeled examples. The input is a short segment of DNA 
nucleotides, and the goal is to learn to predict whether the DNA segments 
contain a promoter site. There are 31 rules in the domain theory, and 940 
examples in the dataset. 

— financial advisor 

The financial advisor rule base contains 9 rules as shown in Figure El jTO). As 
in 0, a set of 5500 labeled examples that are consistent with the rule base is 
randomly generated. 500 examples are used for training and the remaining 
5000 is used for testing. 



1 if (sav_adeq and inc_adeq) then invest_stocks 

2 if dep jav_adeq then sav_adeq 

3 if assetsjii then sav_adeq 

4 if (dep_inc_adeq and earn_steady) then inc_adeq 

5 if debt Jo then inc_adeq 

6 if (sav > dep * 5000) then dep^av_adeq 

7 if (assets > income * 10) then assets_hi 

8 if (income > 25000 + dep * 4000) then depJnc_adeq 

9 if (debt_pmt < income * 0.3) then debt Jo 



Fig. 2. Financial advisor rule base. 



4.1 Human Genome Project Datasets 

The reported results are based on a 10-fold cross-validation. The average train- 
ing and test accuracies of the rules in domain theory alone were 87.29 D 0.22 
and 87.29 D 2.03 for Ribosome dataset and 77.45 D 0.56 and 77.45 D 5.01 for 
Promoters dataset, respectively. Tableland |21 shows the average generaliza- 
tion accuracy and the average network size (along with the standard deviation^ 
where available) for Ribosome and Promoters datasets, respectively. 

Tableland |5| compare the performance of KBDistAI with that of some of 
the other approaches that have been reported in the literature. For Ribosome 
dataset, it produced a lower generalization accuracy than the other approaches 
and generated networks that were larger than those obtained by Tiling-Pyramid. 
We believe that this might have been due to overfitting. In fact, when the network 
pruning procedure was applied, the generalization accuracy increased to 91.8 D 

^ The standard error can be computed instead, for better interpretation of the results. 
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Table 1. Results of Ribosome dataset. 





Test % 


Size 


Rules alone 


87.3 ± 2.0 


— 


KBDistAI (no pruning) 


86.3 ± 2.4 


40.3 ± 1.3 


KBDistAI (with pruning) 


91.8 ± 1.8 


16.2 ±3.7 


Tiling-Pyramid 


90.3 ± 1.8 


23 ±0.0 


TopGen 


90.9 


42.1 ± 9.3 


REGENT 


91.8 


70.1 ±25.1 



Table 2. Results of Promoters dataset. 





Test % 


Size 


Rules alone 


77.5 ± 5.0 


— 


KBDistAI (no pruning) 


93.0 ± 2.8 


12.2 ± 1.0 


KBDistAI (with pruning) 


95.5 ± 3.3 


3.9 ±2.3 


Tiling-Pyramid 


96.3 ± 1.8 


34 ± 0.0 


TopGen 


94.8 


40.2 ± 3.3 


REGENT 


95.8 


74.9 ± 38.9 



1.8 with smaller network size of 16.2 D 3.7. In the case of the Promoters dataset, 
KBDistAI produced comparable generalization accuracy with smaller network 
size. As in Ribosome, network pruning boosted the generalization accuracy to 
95.5 D 3.3 with significantly smaller network size of 3.9 D 2.3. 

The time taken in our approach is significantly less than that of the other 
approaches. KBDistAI takes fraction of a minute to a few minutes of CPU time 
on each dataset used in the experiments. In contrast, TopGen and REGENT were 
reported to have taken several days to obtain the results reported in m- 

4.2 Financial Advisor Rule Base 

As explained earlier, 5500 patterns were generated randomly to satisfy the rules 
in Figure |2l of which 500 patterns were used for training and the remaining 
5000 patterns were used for testing the network. In order to experiment with 
several different incomplete domain theories, some of the rules were pruned with 
its antecedents in each experiment. For instance, if sav-adeq was selected as 
the pruning point, then the rules for sav-adeq, depsav-adeq, and assetsJii are 
eliminated from the rule base. In other words rules 2, 3, 6, and 7 are pruned. 
Further, rule I is modified to read “if (inc-adeq) then invest_stocks” . Then the 
initial network is constructed from this modified rule base and augmented using 
constructive learning. 

Our experiments follow those performed in and PS]- As we can see in Ta- 
ble Eland El KBDistAI either outperformed the other approaches or gave compa- 
rable results. It resulted in higher classification accuracy than other approaches 
in several cases, and it always produced fairly compact networks while using 
substantially lower amount of computational resources. Again, as in the Human 
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Genome Project datasets, network pruning boosted the generalization in all cases 
with smaller network size. For the pruning points in Tabled (the sequence from 
dep-sav-adeq to inc-adeq), the generalization accuracy improved to 89.2, 99.5, 
98.4, 92.9, 94.9 and 93.0 with network sizes of 17, 2, 5, 9, 5 and 12, respectively. 



Table 3. Results of financial advisor rule base (HDE). 



Pruning point 


HDE 


Rules alone 


Test % 


Hidden Units 


Test % 


dep_sav_adeq 


92.7 


31 


75.1 


asset sJii 


92.4 


23 


93.4 


depjnc_adeq 


85.8 


25 


84.5 


debt Jo 


84.7 


30 


61.7 


sav_adeq 


92.2 


19 


90.9 


inc_adeq 


81.2 


32 


64.6 



Table 4. Results of financial advisor rule base (KBDistAI and Tiling-Pyramid). 



Pruning point 


KBDistAI 


Tiling-Pyramid 


Rules alone 


Test % 


Size 


Test % 


Size 


Test % 


dep_sav_adeq 


88.5 


21 


91.2 ± 1.7 


28.2 ± 3.6 


52.4 


assets_hi 


99.5 


2 


99.4 ± 0.2 


10 ± 0.0 


99.5 


depJnc_adeq 


98.0 


8 


94.3 ± 1.5 


21.0 ± 3.1 


90.4 


debt Jo 


91.6 


16 


94.1 ± 2.0 


22.1 ± 4.0 


81.2 


sav_adeq 


93.8 


10 


90.8 ± 1.5 


26.4 ± 3.3 


87.6 


inc_adeq 


91.2 


18 


83.8 ± 2.2 


32.7 ± 2.9 


67.4 



5 Summary and Discussion 

Theory refinement techniques offer an attractive approach to exploiting avail- 
able domain knowledge to enhance the performance of data-driven knowledge 
acquisition systems. Neural networks have been used extensively in theory refine- 
ment systems that have been proposed in the literature. Most of such systems 
translate the domain theory into an initial neural network architecture and then 
train the network to refine the theory. The KBANN algorithm is demonstrated to 
outperform several other learning algorithms on some domains However, 

a significant disadvantage of KBANN is its fixed network topology. TopGen and 
REGENT algorithms on the other hand allow modifications to the network ar- 
chitecture. Experimental results have demonstrated that TopGen and REGENT 
outperform KBANN on several applications. [Z2Eni. The Tiling-Pyramid algo- 
rithm proposed in for constructive theory refinement builds a network of 
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perceptrons. Its performance, in terms of classification accuracies attained, as 
reported in is comparable to that of REGENT and TopGen, but at signifi- 
cantly lower computational cost. 

The implementation of KBDistAI used in the experiments reported in this 
paper uses the rules directly (by augmenting the input patterns with the out- 
puts obtained from the rules) as opposed to the more common approach of 
incorporating the rules into an initial network topology. The use of DistAI for 
network construction makes KBDistAI significantly faster than approaches that 
rely on iterative weight update procedures (e.g., perceptron learning, backpropa- 
gation algorithm) and/or computationally expensive genetic search. Experimen- 
tal results demonstrate that KBDistAI’s performance in terms of generalization 
accuracy is competitive with that of several of the more computationally expen- 
sive algorithms for data-driven theory refinement. Additional experiments using 
real-world data and domain knowledge are needed to explore the capabilities 
and limitations of KBDistAI and related algorithms for theory refinement. We 
conclude with a brief discussion of some promising directions for further research. 

It can be argued that KBDistAI is not a theory refinement system in a strict 
sense. It makes use of the domain knowledge in its inductive learning proce- 
dure rather than refining the knowledge. Perhaps KBDistAI is more accurately 
described as a knowledge guided inductive theory construction system. 

There are several extensions and variants of KBDistAI that are worth ex- 
ploring. Given the fact that DistAI relies on inter-pattern distances to induce 
classifiers from data, it is straightforward to extend it so as to handle a much 
broader class of problems including those that involve patterns of variable sizes 
(e.g., strings) or symbolic structures as long as suitable inter-pattern distance 
metrics can be defined. Some steps toward rigorous definitions of distance met- 
rics based on information theory are outlined in jIB|. Variants of DistAI and 
KBDistAI that utilize such distance metrics are currently under investigation. 

Several authors have investigated approaches to rule extraction from neural 
networks in general, and connectionist theory refinement systems in particular 
EEEa. One goal of such work is to represent the learned knowledge in a form 
that is comprehensible to humans. In this context, rule extraction from classifiers 
induced by KBDistAI is of some interest. 

In several practical applications of interest, all of the data needed for syn- 
thesizing reasonably precise classifiers is not available at once. This calls for in- 
cremental algorithms that continually refine knowledge as more and more data 
becomes available. Computational efficiency considerations argue for the use of 
data-driven theory refinement systems as opposed to storing large volumes of 
data and rebuilding the entire classifier from scratch as new data becomes avail- 
able. Some preliminary steps in this direction are described in H2| 

A somewhat related problem is that of knowledge discovery from large, phys- 
ically distributed, dynamic data sources in a networked environment (e.g., data 
in genome databases). Given the large volumes of data involved, this argues for 
the use of data-driven theory refinement algorithms embedded in mobile soft- 
ware agents that travel from one data source to another, carrying with 
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them only the current knowledge base as opposed to approaches rely on shipping 
large volumes of data to a centralized repository where knowledge acquisition 
is performed. Thus, data-driven knowledge refinement algorithms constitute one 
of the key components of distributed knowledge network environments for 
knowledge discovery in many practical applications (e.g., bioinformatics). 

In several application domains, knowledge acquired on one task can often be 
utilized to accelerate knowledge acquisition on related tasks. Data-driven theory 
refinement is particularly attractive in applications that lend themselves to such 
cumulative multi-task learning m- The use of KBDistAI or similar algorithms 
in such scenarios remains to be explored. 
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Abstract. The goal of input-output modeling is to apply a test input to 
a system, analyze the results, and learn something useful from the cause- 
effect pair. Any automated modeling tool that takes this approach must 
be able to reason effectively about sensors and actuators and their in- 
teractions with the target system. Distilling qualitative information from 
sensor data is fairly easy, but a variety of difficult control-theoretic issues 
— controllability, reachability, and utility — arise during the planning 
and execution of experiments. This paper describes some representations 
and reasoning tactics, collectively termed qualitative bifurcation analy- 
sis, that make it possible to automate this task. 



1 Input-Output Modeling 

System identification (SID) is the process of inferring an internal ordinary dif- 
ferential equation (ODE) model from external observations of a system. The 
computer program PRExj^ automates the SID process, using a combination 
of artificial intelligence and system identification techniques to construct ODE 
models of lumped-parameter continuous-time nonlinear dynamic systems. As di- 
agrammed in Fig. n PRET uses domain knowledge to combine model fragments 
into ODEs, then employs actuators and sensors to learn more about the target 




Fig. 1. PRET uses sensors and actuators to interact with target systems in an 
input-output approach to dynamical system modeling. 
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system, and finally tests the ODEs against the actuator/sensor data using a 
body of mathematical knowledge encoded in first-order logic pm . 

This input-output (I/O) approach to dynamical system modeling, which dis- 
tinguishes PRET from other AI modeling tools, is very powerful and also ex- 
tremely difficult. Distilling available sensor information into qualitative form is 
reasonably straightforward, as described in our IDA-97 paper |^, but reasoning 
about the information so derived is subtle and challenging. Dealing with actu- 
ators is even harder because of the nonlinear control theory that is involved. 
Among other things, determining what experiments one can perform from the 
system’s present state involves complicated reasoning about controllability and 
reachability. In an automated framework, it is also important to reason about 
what can be learned from a given experiment. During the input-output modeling 
process, pret must solve all three of these problems. That is, given a black-box 
system, a partial measurement of its current state, some knowledge about the 
available actuators, and some preliminary ideas about a candidate model, pret 
must be able to decide what experiments are possible and useful. This is a dif- 
ficult, open problem for nonlinear systems, even for human experts. The topic 
of this paper is a set of knowledge representation and reasoning techniques that 
make it possible to automate this task. 

In linear systems these problems are relatively easy. Engineering approaches 
to linear input-output analysis are well developed; standard techniques for ex- 
citing different useful states of the systemjT^l include changing the type (e.g., 
ramp, step) or parameters (e.g., amplitude, frequency) of the input. The impulse 
response of a system — its transient response to a quick kick x(to) = 1; x(t) = 
OVt ^ to — is particularly useful. The natural resonant and anti-resonant fre- 
quencies appear as spikes and the mode shapes between those spikes can show 
whether a vibrating mechanical system is mass- or stiffness-dominated 1 1 bj . 

Nonlinear systems pose a far more imposing challenge to input-output model- 
ing; their mathematics is vastly harder, and many of the analysis tools described 
in the previous paragraph do not apply. Almost all forms of transient analysis 
(e.g., step or impulse response) are useless in nonlinear problems, as is frequency 
response; the concept of a discrete set of spectral components simply does not 
make sense. Because of this, nonlinear dynamicists typically allow transients to 
die out and then reason about attractors in the phase or state space, and how the 
geometry and topology of those attractors change when the system parameters 
are varied. 

Our approach targets the problems that arise in reasoning about multiple 
set of observations that arise in phase-portrait analysis of complex systems. In 
particular, we use a combined state/parameter space and decompose it into dis- 
crete regions, each associated with an equivalence class of dynamical behaviors, 
derived qualitatively using geometric reasoning. These discrete regions describe 
the behavior of the system in a uniquely powerful way. As each trajectory is 
effectively equivalent, in a well-known sense, to all the other trajectories in the 
same region, one can describe the behavior in that region in a much simpler way, 
which results in ease of analysis — and great computational savings. 
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The representation described in this paper — an abstraction/extension of 
the traditional nonlinear analysis technique termed bifurcation analysis — al- 
lows fret’s intelligent sensor analysis and actuator control modules to reason 
effectively about multiple sets of observations over a given system. Coupled with 
a knowledge representation and reasoning framework that adapts smoothly to 
how much one knows about the system (e.g., using linear analysis when appropri- 
ate), which is described in another paper |^, this representation allows fret to 
reason effectively about input-output modeling of nonlinear dynamical systems. 

To set the context, the following section gives a brief overview of fret. We 
then focus in on the input-output modeling phase, describe our representation 
and reasoning framework, and show how fret exploits that framework. 



2 PRET 

As outlined in the previous section, fret|^ is an automated tool for nonlinear 
system identification (SID). Its inputs are a set of observations of the outputs of 
a black-box system, and its output is an ordinary differential equation (ODE) 
model of the internal dynamics of that system, fret’s architecture wraps a 
layer of artificial intelligence (AI) techniques around a set of traditional formal 
engineering methods like impulse-response analysis, nonlinear regression, etc. 
The AI layer combines several forms of reasoning^ via a special first-order logic 
inference systemfEl EDI intelligently assess the task at hand; it then reasons 
from that information to automatically choose, invoke, and interpret the results 
of appropriate lower-level techniques. This framework lets fret shift fiuidly 
back and forth between domain-specific reasoning and general mathematics to 
navigate efficiently through an exponential search space of possible models. This 
approach has met with success in a variety of simulated and real problems, 
ranging from textbook systems to real-world engineering applications. 

FRET takes a “generate-and-test” approach to model building. It uses domain- 
specific knowledge to assemble combinations of user-specified and automatically 
generated ODE fragments into a candidate modelfl it tests that model by per- 
forming a series of factual inferences about the ODE and the observations and 
then using a theorem proverf^ to search for contradictions in those sets of 
facts. The technical challenge here is efficiency: the search space is huge, and so 
FRET must identify contradictions as quickly, simply, and cheaply as possible. 
The key to doing so is to classify model and system behavior at an appropri- 
ate qualitative level and to exploit all available domain-specific knowledge in 
the most useful way. Symbolic algebra can be used to remove huge branches 
from the search space. If the target system is known to be chaotic, for instance, 
all linear ODEs can be immediately discarded, and the computation involved 

^ qualitative reasoning, qualitative simulation, numerical simulation, geometric reason- 
ing, constraint propagation, resolution, reasoning with abstraction levels, declarative 
meta-control, and a simple form of truth maintenance. 

^ In mechanics, for instance, pret uses Newton’s laws to combine force terms; in 
electronics, it uses Kirchhoff’s laws to sum voltages in a loop or currents in a cutset. 
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— calculating the Jacobian and ascertaining that all of its entries are constant 

— requires only simple, inexpensive symbolic reasoning. In other situations, 
pruning a single leaf off the tree of possible models can be extremely expensive 
(e.g., estimating parameter values for a nonlinear ODE prior to a final corrob- 
orative simulation/comparison run, which is a complicated global optimization 
problem^). Some analysis methods, such as phase-portrait analysis, apply to 
all ODEs, whereas others are only meaningful in specific domains (e.g., creep 
tests in viscoelastic systems). Orchestrating this complex reasoning process is 
a very difficult problem; its solution requires carefully crafted knowledge repre- 
sentation frameworks 0 that allow for an elegant formalization of the essential 
building blocks of an engineer’s knowledge and reasoning, and powerful auto- 
mated machinery j2Dj that uses the formalized knowledge to reason flexibly about 
a variety of modeling problems. 

The input-output modeling strategies that are the topic of this paper play 
important roles in both the generate and the test phase. The “input” half of 
fret’s intelligent sensor/actuator analysis and control module — which is re- 
viewed briefly in the following sections and covered in detail in 0 — uses geo- 
metric reasoning and delay-coordinate embedding to distill abstract, useful qual- 
itative information from a highly specific numeric sensor data set. The “output” 
part, described in the following sections, reasons about multiple sets of obser- 
vations about a given system using a new knowledge representation called the 
qualitative state /parameter space and an associated reasoning strategy termed 
qualitative bifurcation analysis, both of which are Al-adapted versions of well- 
known nonlinear dynamics techniques. For more details on the rest of fret — 
issues, solutions, internal representations, encoded knowledge bases, examples 
solved, etc. — please consult the papers cited in the previous two paragraphs. 

3 Qualitative Bifurcation Analysis 

One of the goals of the qualitative reasoning (QR) communitvj 1 is to abstract 
specific instances of behavior into more-general descriptions of a system. An 
80kg adult bouncing on the end of a bungee cord, for instance, will produce a 
different time series from a 50kg child, but both produce similar damped oscil- 
latory responses. Reasoning about these two behaviors in their time-series form 
can be difficult, as it requires detailed examination of the amplitude decay rate 
of and the phase shift between two decaying sinusoids. The state-space repre- 
sentation, which suppresses the time variable and plots position versus velocity, 
brings out the similarity between these two behaviors in a very clear way. Both 
bungee jumps, for example, manifest on a state-space plot as similar decaying 
spirals. Automated phase-portrait analysis techniques j3 E3 > which combine 
ideas from dynamical systems, discrete mathematics, and artificial intelligence, 
generate qualitative descriptions that capture this information. 

A discretized version of the state-space representation can abstract away 
many low-level details about the dynamics of a system while preserving its impor- 
tant qualitative properties. The cell-to-cell-mapping forma,lismjl4). for instance. 
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Fig. 2. Identifying a limit cycle using the cell-dynamics method. 



discretizes a set of n-dimensional state vectors onto an n-dimensional mesh of 
uniform boxes or cells. The circular state-space trajectory in Fig. 0(a), for exam- 
ple, — a sequence of two- vectors of floating-point numbers — can be represented 
by the cell sequence [. . . (0,0) (1,0) (2,0) (3,0) (4,0) (4,1) (4,2) (4,3) . . .]. Because 
multiple trajectory points are mapped into each cell, this discretized represen- 
tation of the dynamics is significantly more compact than the original series of 
floating-point numbers and therefore much easier to work with. Using this repre- 
sentation, the dynamics of a trajectory can be quickly and qualitatively classified 
using simple geometric heuristics — in this case as a limit cycle, fret’s intelli- 
gent sensor analysis procedures use this type of discretized geometric reasoning 
to “distill” out the qualitative features of a given state-space portrait, allowing 
FRET to reason about these features at a much higher (and cheaper) abstraction 
level. This scheme is covered in detail in [3|. 

This is only, however, a very small part of the power of the qualitative phase- 
portrait representation. Dynamical systems can be extremely complicated; at- 
tempting to understand one by analyzing a single behavior instance — e.g., sys- 
tem evolution from one initial condition at one parameter value, like Fig. Ha) 
— is generally inadequate. Rather, one must vary a system’s inputs and control 
parameters and study the change in the response. Even in one-parameter sys- 
tems, however, this procedure can be difficult; as the parameter is varied, the 
behavior may vary smoothly in some ranges and then change abruptly (“bifur- 
cate”) at critical parameter values. A thorough representation of this behavior, 
then, requires a “stack” of state-space portraits: at least one for each interest- 
ing and distinct range of values. Constructing such a stack requires automatic 
recognition of the boundaries between these ranges, and the cell dynamics rep- 
resentation makes this very easy. Fig. Hb), for example, shows another limit 
cycle trajectory — one with different geometry but identical topology. The key 
concept here is that a set of geometrically different and yet qualitatively similar 
trajectories — an “equivalence class” with respect to some important dynamical 
property — can be classified as a single coherent group of state-space portraits. 
This is the basis of the power of the techniques described in this paper. 

Consider, for example, a driven pendulum model described by the ODE 
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with mass (m), arm length (Z), gravity constant (g), damping factor (/3), drive 
amplitude ( 7 ) and drive frequency (a), m, I, g and /3 are constants; the state 
variables of this system are 6 and uj = 6. In many experiments, the drive am- 
plitude and/or frequency are controllable: these are the “control parameters” of 
the system. The behavior of this apparently simple device is really quite com- 
plicated and interesting. For low drive frequencies, it has a single stable fixed 
point; as the frequency is raised, the attractor undergoes a series of bifurcations 
between chaotic and periodic behavior. These bifurcations do not, however, nec- 
essarily cause the attractor to move. That is, the qualitative behavior of the 
system changes and the operating regime (in state space) does not. Traditional 
analysis of this system would involve constructing state-space portraits, like the 
ones shown in Fig. El at closely spaced control parameter values across some 
interesting range; this is the bifurcation analysis procedure introduced in the 
previous paragraph. Traditional Al/hybrid representations do not handle this 
smoothly, as the operating regimes involved are not distinct. If, however, one 




Fig. 3. A state/parameter (S/P) space portrait of the driven pendulum: a pa- 
rameterized collection of state-space portraits of the device at various Drive 
Frequencies. 



adds a parameter axis to the state space, most of these problems vanish. Fig. 0 
describes the behavior of the driven pendulum in this new state/parameter-space 
(S/P-space) representation. Each 0,oj slice of this plot is a state-space portrait, 
and the control parameter varies along the Drive Frequency axis. 

Our final step is to combine this state/parameter-space idea with the qual- 
itative abstraction of cell dynamics, producing the qualitative state/parameter 
space (QS/P-space) representation that is the basis of the KRR framework that 
is the topic of this paper. A QS/P-space portrait of the driven pendulum is 
shown in Fig.0 This representation is similar to the S/P-space portrait shown 
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Fig. 4. A qualitative state/ parameter-space (QS/P-space) portrait of the driven 
pendulum. This is an abstraction of the state/parameter space portrait shown in 
Fig.El it groups qualitatively similar behaviors into equivalence classes to define 
the boundaries of qualitatively distinct regions of state/parameter space. 



in Fig.0, but it groups similar behaviors into equivalence classes, and then uses 
those groupings to define the boundaries of qualitatively distinct regions. 

This qualitative state/parameter-space representation is an extremely pow- 
erful modeling tool. One can use it to identify the individual operating regimes, 
then create a separate model in each, and perhaps use a finite-state machine to 
model transitions between them. More importantly, however, the QS/P-space 
representation lets the model builder leverage the knowledge that its regions — 
e.g., the five slabs in Fig. El — all describe the behavior of the same system, at 
different parameter values. This is exactly the type of knowledge that one needs 
in order to plan how to learn more about a system by changing its inputs and 
observing the results. The remainder of this paper expands upon these ideas, de- 
scribing how the QS/P-space representation helps pret perform input-output 
modeling of dynamical systems. 

4 Input-Output Modeling in PRET 

The goal of input-output modeling is to apply a test input to a system, analyze 
the results, and learn something useful from the cause/effect pair. In this section, 
we describe how pret reasons about this process using the QS/P representation 
introduced in the previous section. 

As described in Sect. El pret takes a generate-and-test approach, using a 
small, powerful domain theory to build ODE models and a larger, more-general 
ODE theory to test those models against the known behavior of the system. I/O 
modeling using the QS/P representation contributes to this process in a variety 
of ways. Firstly, it allows pret to reason effectively about test inputs; a good test 
input excites the behavior in a useful but not overwhelming way, and choosing 



350 Matthew Easley and Elizabeth Bradley 



such an input is nontrivial. The representation described in the previous section 
and pictured in Fig. 2| also allows PRET to reason about sensible hypothesis 
combinations — a process without which the generate phase would be reduced 
to blind enumeration of an exponential number of candidate models. Finally, 
qualitative I/O modeling techniques help pret reason about state variables and 
observations — information whose sole source would otherwise be the user. 

The “input” part of pret’s input-output reasoning takes place in the intelli- 
gent sensor data analyzer^ . This module first reconstructs any hidden dynamics 
from the sensor data and then analyzes the results using geometric reasoning. 
The first of these two steps is necessary because fully observable systems, in 
which all of a system’s state variables can be measured, are rare in normal 
engineering practice. Often, some of the state variables are either physically 
inaccessible or cannot be measured with available sensors. This is control the- 
ory’s observer problem: the task of inferring the internal state of a system solely 
from observations of its outputs. Delay-coordinate embeddingPJ, pret’s solution 
to this problem, creates an m-dimensional reconstruction-space vector from m 
time-delayed samples of data from a single sensor. The central idea is that the 
reconstruction-space dynamics and the true (unobserved) state-space dynam- 
ics are topologically identical. This provides a partial solution to the observer 
problem, as a state-space portrait reconstructed from a single sensor is quali- 
tatively identical to the true multidimensional dynamics of the system^ Given 
a reconstructed state-space portrait of the system’s dynamics, the intelligent 
sensor data analyzer’s second phase distills out its qualitative properties using 
the cell dynamics paradigm discussed in Sect. Q The results of reconstructing 
and analyzing the sensor data are a set of qualitative observations similar to 
those a human engineer would make about the system, such as “the system is 
oscillating.” This information is useful as it not only raises the abstraction level 
of pret’s reasoning about models but also is critical to the mechanics of the 
qualitative bifurcation analysis process, as described later in this section. 

Reasoning about actuators is much more difficult, so the development of 
pret’s intelligent actuator controller has been slow. The problem lies in the 
inherent difference between passive and active modeling. It is easy to recognize 
damped oscillations in sensor data without knowing anything about the system 
or the sensor, but using an actuator requires a lot of knowledge about both. 
Different actuators affect different system properties (e.g., the half dozen knobs 
on the front of a stereo receiver). They also have very different characteristics 
(range, resolution, response time, etc.); consider the different dynamics of cook- 
ing on campfires, gas/electric stoves, or blast furnaces. Identical actuators can 
affect systems in radically different ways; a gear shift lever in a car, for instance, 
invokes very different responses, if it is moved into “first” or “reverse.” When 
the sensor and the system are linear, there are some useful standard procedures 
for choosing test inputs, codifying the results, and reasoning about their impli- 
cations — e.g., step and impulse response — but these kinds of drive signals 

® This property also allows pret to estimate an upper bound on the number of state 
variables in a system. 



Reasoning about Input-Output Modeling of Dynamical Systems 351 



elicit tremendously complicated responses from nonlinear systems, making out- 
put analysis very difficult. In nonlinear systems analysis, one typically applies 
constant inputs, ignores any transients, and reasons about the resulting attrac- 
tors in the state-space representation, as described in Sect. 0 Deciding how to 
use an actuator is only the first part of the problem. Any planning about experi- 
ments must also consider the set of possible states of the system — those that are 
reachable from the existing state with the available control input. Finally, effec- 
tive input-output modeling requires reasoning about useful experiments: those 
that increase one’s knowledge about the target system in a productive way. The 
ultimate goal of fret’s intelligent sensor/actuator control module is to find and 
exploit the overlap between these sets of useful and possible experiments. 

To solve these difficult problems — controllability, reachability, and utility 
— FRET must reason about multiple sets of observations of a system, each made 
under a different actuator condition. It must also plan those actuator conditions, 
which involves modeling not only the actuator itself but also the behavior of the 
actuator-system interface. Our current solution assumes that fret knows the 
actuator input range — a reasonable assumption because the actuator normally 
exists as an external device, unlike the internal workings of an unknown phys- 
ical system. Using the QS/P paradigm developed earlier, coupled with the cell 
dynamics technique and a simple binary search strategy, fret first performs a 
qualitative bifurcation analysis. It begins at the lower end of the actuator range, 
setting the drive signal to a constant value, letting the transient die out, and 
then using cell dynamics to classify the behavior. It then increments the actuator 
input and repeats the process. When the attractor bifurcates, fret zeroes in on 
the bifurcation point by successively bisecting the actuator input interval. The 
result of this procedure is a QS/P-space portrait of the system, complete with 
regime boundaries and behavioral descriptions in each regime, such as: 

“in the temperature range from 0 to 50°C', the system undergoes a 

damped oscillation to a fixed point at {x, y) = (1.4, —8); when T > 50°C, 

it follows a period- two limit cycle located at...”0 

FRET then invokes the model-building process in each regime, and finally at- 
tempts to unify these models into a single ODE. 

In the driven pendulum example, this procedure works as follows. Qualitative 
bifurcation analysis identifies five separate qualitative state/parameter-space re- 
gions, as shown in Fig. El fret then builds an ODE model for each regime using 
procedures described in Sect.Q These ODEs are shown in Tabled Note that four 
of these five ODEs are different, but all five are, in reality, instances of a single 
ODE that accounts for the physical behavior across the whole parameter range. 
fret’s goal is to find that globally valid model, so it must unify these ODEs. 
Unification is reasonably straightforward if it is correctly interleaved with the 
model-building process. In the driven pendulum, for example, fret analyzes the 
system in the small-angle regime^ producing the model '0{t) = When 

^ fret’s syntax is much more cryptic; it has no natural language capabilities. 

® where sinS ~ 9 and the system acts like a simple harmonic oscillator 
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Table 1. Valid models of the driven pendulum in different behavioral regimes. 



Drive Frequency 


ODE 


Description 


None 


- Tsine(t) 


damped oscillator 


Low 


9{t) = —j sin 9(t) 


nonlinear solution 


Medium 


W + f sin 9{t) = sin at 


“true” (full) solution 


High 


9{t) = — J sin 9(t) 


nonlinear solution 


Very High 


1 

11 


linear (small angle) solution 



the actuator moves the system to the neighboring limit cycle regime, where 
larger-angle behavior dominates, the small-angle solution no longer holds, forc- 
ing a new model search, which yields the model 9{t) = —j sin 9{t). pret then 
tries to reconcile the two models, applying both of them in both regimes. Since 
0{t) = —^0{t) is a special case of 0{t) = — f sin 0(f), the former holds in only one 
of the two, whereas the latter holds in both, so pret discards the 6{t) = 
model and goes on to the next regime, repeating the model building/unification 
process. Once pret finds a single model that accounts for all observed behavior 
in all regimes across the range of interest, its task is complete. Such a model may 
not, of course, exist; a system may be governed by completely different physics 
in different regimes, and no single ODE may be able to account for this kind 
of behavior. In this case, the models in the different regimes would be mutually 
exclusive, and pret would be unable to unify them into a single ODE, and so it 
would simply return the list of regimes, models, and transitions. This is exactly 
the form of a traditional hybrid modelpj of a multi-regime system. 

As is true of automated modeling in general, evaluating the results of this 
approach can be difficult because the question “How is this model better?” is 
hard to formalize. From an engineering standpoint, a successful model is one that 
matches observed behavior to within predefined specifications; pret is designed 
to be an engineer’s tool, so its judgment of what constitutes success or failure 
is exactly that. Parsimony is another desirable attribute in a model: one wishes 
to account for the observed behavior using as few — and as simple — ODE 
terms as possible. Finally, the speed with which pret produces such a model is 
another important metric, particularly as we work with more-complex systems 
and search spaces. Ultimately, the best form of evaluation will consist of whether 
or not pret’s models are useful for control system design — that is, whether 
the ODE that pret constructs of a radio-controlled car can actually be used as 
the heart of a controller designed to direct that car to perform some prescribed 
action. We are in the process of evaluating models of real-world systems in several 
domains — ranging from robotics to hydrology — in this manner. 

5 Relationship to Related Work 

Most of the work in the AI/QR modeling community builds qualitative mod- 
els by combining a set of descriptions of state into higher-level abstractions or 
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qualitative statesO^J. Many tools also reason about equations at varying lev- 
els of abstraction, from qualitative differential equations (QDEs) in QSIMUSj 
to ODEs in pret. fret’s approach differs from many of these tools in that it 
works with noisy, incomplete sensor data from real-world systems, and attempts 
not to “discover” the underlying physics, but rather to find the simplest ODE 
that can account for the given observation. In the QR research that is most 
closely related to pret, ODE models are built by evaluating time series using 
qualitative reasoning techniques and then using a parameter estimator to match 
the resulting model with a given observed system^ - This modeling tool selects 
models from a set of pre-enumerated solutions in a very specific domain (lin- 
ear visco-elastics), pret is much more general; it works on linear and nonlinear 
lumped-parameter continuous-time ODEs in a variety of domains and uses dy- 
namic model generation to handle arbitrary devices and connection topologies. 

PRET shares goals and techniques with several other fields. It solves the same 
problems as traditional system identification IT^ . but in an automated fashion, 
and it relies upon many of the standard methods and ideas found in basic control 
theory texts such as controllability and reachability^Tj. Finally, pret includes 
many of the same concepts that appear in the data analysis literature [ I D). but it 
adds a layer of AI techniques, such as symbolic data representation and logical 
inference, on top of these. 



6 Conclusion 

The goal of the work described in this paper is to automate the type of input- 
output analysis that expert scientists and engineers apply to modeling problems, 
and to use that technology to improve the pret modeling tool, which automat- 
ically constructs ODE models of nonlinear dynamical systems. The challenges 
involved are significant; the nonlinear control-theoretic issues involved in plan- 
ning and executing experiments routinely stymie human experts. First, pret 
must autonomously manipulate a control parameter in order to analyze the sys- 
tem and find behaviorally distinct regimes. Then, it must use knowledge about 
the behavior and the regime boundaries to reason about what experiments are 
useful and possible. Finally, pret must use this information to perform the 
experiments and analyze the results. 

The qualitative state/parameter-spacerepresentation described in this paper 
solves some of the problems that arise in phase-portrait analysis of complex sys- 
tems by combining a state/parameter-space representation with the qualitative 
abstraction of cell dynamics. This QS/P-space representation, wherein a system’s 
dynamics are classified into discrete regions of qualitatively identical behavior, 
supports a set of reasoning tactics, collectively termed qualitative bifurcation 
analysis, which allows pret to reason about multiple sets of observations over a 
given system. 

fret’s sensor-related reasoning is essentially complete, but its reasoning 
about the relationship between models and excitation sources — as well as final 
design decisions about how to treat actuator knowledge in an explicit way — are 
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still under development, pret currently uses very little domain knowledge about 
its target systems; instead, it relies upon general mathematics and physics — 
principles that are broadly applicable and supported by a well-developed, highly 
formalized body of mathematical knowledge that applies in any domain. The 
point of this decision was to make pret easily extensible to other domains; be- 
cause of this choice, refitting pret for some new domain is simply a matter of 
a few lines of Scheme code. However, as we extend pret into more network- 
oriented domains, such as electrical circuits, we are discovering that effective use 
of domain theory may be critical to streamlining pret’s generate phase 0. A 
network-oriented modeling approach will also help pret reason about actuators 
in a more-intelligent fashion, as the actuator itself, with its various, non-ideal 
properties, may be represented directly as part of the network. For example, a 
sinusoidal current source often has an associate impedance that creates a loading 
effect on the rest of an electrical circuit. With a network approach, these effects 
naturally become part of the model — just as they do in real systems. 
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Abstract. Statistical advisors are small pieces of code that are run in 
response to user actions which provide statistical support for the user, 
especially users who lack statistical sophistication. These could provide 
advice in the form of a message, or could automatically perform supple- 
mental actions which change the model according to the advice. How- 
ever, when the original action is undone, the supplemental actions need 
to be undone as well. This paper describes an implementation of an 
intelligent advisory system, NAEPVUE, for analyzing data from the Na- 
tional Assessment of Educational Progress (NAEP). NAEPVUE uses an 
object-oriented data dictionary for storing statistical advisors. Built in 
the Amulet user interface development environment, NAEPVUE is able 
to take advantage of the Myers and Kosbie hierarchical undo model to 
undo the actions of statistical advisors. 



1 Rationale 

Many diverse types of expertise are need to analyze a large complex data set. 
These include: (1) expertise in the scientific domain, (2) an understanding of how 
the data were collected and what the variables represent, (3) an understanding of 
the statistical methods used in the analysis and (4) an understanding of how the 
data are stored and how to extract a meaningful subset for analysis. It is rare to 
find all of those types of expertise embodied in one person. A statistical advisory 
system can help by allowing one person to transfer some of their expertise to 
the analyst. This is particularly true in the case of large government surveys, 
where the primary analysts can transfer expertise to secondary users through 
the advisory system. 

Consider the National Assessment for Educational Progress (NAEP) - a 
comprehensive, ongoing study of multiple aspects of educational achievement of 
United States 4th, 8th and 11th grade students. The data for one grade level of 
the 1996 survey contains 998 variables, including cognative measurement on the 
students and background variables on students, teachers, classrooms and schools, 
taken for 121,000 individuals. To achieve high accuracy estimates of small sub- 
populations, NAEP employs a complex multi-staged sampling design using both 
stratification and clustering. To minimize the burden on individual students, 
items are administered in balanced incomplete blocks which create complex pat- 
terns of missing data; to compensate, the public use data tapes provide multiple 
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plausible values for the proficiency variables. To analyze NAEP data, a researcher 
must be familiar not only with educational policy issues, but also with multilevel 
models, weighted analysis and multiple imputation techniques, not to mention 
the meaning of the NAEP background and scale variables. Statistical advisory 
systems can help secondary analysts by suppling just-in-time advice to suple- 
ment the analysts expertise. 

NAEPVUE (Sect. 2) is a prototype advisory system for NAEP. Written in 
Amulet (Myers et ah, 1996), it provides a graphical front end to the process of 
selecting a set of NAEP variable for analysis. Analysts specify models by drag- 
ging icons representing variables around on the screen. The placement of the 
variables on the screen indicates the intended role of the variable in the analy- 
sis. NAEPVUE contains a complex data encyclopedia which records information 
about the variables in a hierarchical taxonomy of variable types. Statistical ad- 
visors can be attached to any level of this hierarchy. 

Statistical advisors are small pieces of code which run in response to user 
actions which change the specifications of a model. They can provide advice 
about possible problems with the model, and recommend or automatically ap- 
ply changes in the model for statistical reasons. For example, an advisor might 
suggest a transformation, or automatically select the proper weights based on 
the response variable. Section 3 describes the NAEPVUE advisory system. 

Advisors which change the model present a problem if the user interface 
supports an undo operation. When a change to the model is undone, the corre- 
sponding advise must be undone as well. NAEPVUE can take advantage of the 
Myers and Kosbie (1996) hierarchical command objects implemented in Amulet 
to group the action of the advisors with the commands which triggered them. 
This provides a natural mechanism for undoing the statistical advise. Section 4 
describes this in greater details. Section 5 outlines some alternative strategies. 

2 NAEPVUE Overview 

In order to analyze the NAEP data, researchers must understand a variety of 
technical issues detailed in the NAEP Technical Reports (O’Reilly et ah, 1996). 
This is a large volume and finding the information can be a daunting task. The 
NAEPEX program (Rogers, 1995) provides assistance in locating the data on the 
CD-ROM distribution, but it only provides limited (40 character) description of 
the variables. 

While trying to analyze and interpret the results of NAEP, the analyst’s focus 
is not on the educational policy issues, but on the statistical issues presentd by 
the NAEP data. Ideally, the focus of the analyst should be on the scientific 
problem which led to gather and analyze the data; statistical issues should be a 
secondary concern. 

An alternative suggested by Anglin and Oldford (1994) is to focus on the 
model (and the data) as a vehicle for user interaction. NAEPVUE is one real- 
ization of this approach. In NAEPVUE, analysts graphically specify their model 
by dragging variable icons on a model specification dialog (Fig. 1). This dialog 




Undoing Statistical Advice 359 




Fig. 1. NAEPVUE Model Specifier Screen 



is motivated by the graphical model representation of statistical models (Whit- 
taker, 1990). Section 2.1 describes the main features of the NAEPVUE dialog. 

NAEPVUE also employs a data encyclopedia to store metadata about the 
NAEP variables. This provides rapid (and searchable) access to information 
about the collection and interpretation of NAEP variables which previously re- 
quired going back to user manuals or questionnaires. The data encyclopedia is a 
hierarchical knowledge representation system which allows statistical properties 
and advice to be attached on any level of the hierarchy. Section 2.2 describes the 
variable type hierarchy; Section 3 describes the advice system. 



2.1 Model Specifier 

Figure 1 shows the NAEPVUE model specifier screen from the initial prototype. 
The central model display is the heart of the NAEPVUE system. It allows the 
user to specify which variables are to be included in the model and provides a 
“picture” of the current statistical model under consideration. By focusing on 
the model instead of the analysis procedure, NAEPVUE promotes this higher 
level thinking which is closer to the educational policy issues of the analysis. 

The model field is divided into four columns. Several of the columns are 
subdivided into smaller areas. Each of these areas defines a role for the variable. 
The roles are as follows: 
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Variables (Abstract Classes) 




Fig. 2. Top level of Variable Hierarchy. Figures 3 and 4 show lower levels. 



1. Pool variables have not yet been assigned a role. When a variable is selected 
from the data encyclopedia or created through transformation, it is placed 
in the pool until the analyst specifies a role for it. 

2. Explanatory or independent variables are divided into three subroles: Tar- 
get variables are the focus of scientific interest, while control variables are 
included in the model to reduce variance. These variables are treated differ- 
ently in graphical displays. Candidate variables are possible control variables 
used for model selection. 

3. Response variables are the target of measurement. Frequently, these will be 
the NAEP scores, although other derived scores could be used. 

4. Weight The weighting variable is filled in with the sampling weight appro- 
priate for the given response variable. 

5. Group The group variable specifies subpopulations on which to perform par- 
allel analyses; tabulating the results by group. 

6. Cases Certain cases (or individuals with certain properties) can be excluded 
from the model or included in the model by putting appropriate indicator 
variables in the cases area. 

Interactions among the variables can also be manipulated through the model 
display. An interaction is displayed with a star icon, linked to the constituent 
variables. As the NAEP VUE data dictionary keeps track of nesting relationships, 
interactions which should be represented as nestings are automatically treated 
that way. 

Finally, the option buttons below provide the form of the model. In most 
cases, default values are computed by NAEPVUE. The analyst only needs to 
worry about these if they wish to override the default values. 



2.2 Variable Type Hierarchy 

The variable icons in the model display represent more than columns of numbers, 
they point to rich objects containing both data and metadata. In NAEPVUE, 
variables are represented as objects within an ontology (Gruber, 1991). Impor- 
tant information about a variable, such as the survey question, which level it was 
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collected on, and the primary analyst’s notes about the variable are attached as 
properties of the variable. The variables are part of an object hierarchy rela- 
tionship, similar to the one described in Almond, Mislevy and Steinberg (1997). 
This allows the primary analysts to attach “advisor” operators at high level 
abstract variable types which are inherited by instances of that type. For exam- 
ple, an advisor which suggests square root transformations could be attached to 
the variable class “count” and would be inherited by all variables representing 
counts. Hand (1993) suggests other variable type based statistical advice and 
Roth et al. (1994) suggest selecting visualizations based on the variable types. 
Mosteller and Tukey (1977, Chap. 5) give some general advice on transforma- 
tions based on the type of the variable. 

Variable type hierarchies aren’t new, there are many programs which can 
now take advantage of the most basic of types such as real or integer vs ordered 
or unordered factor. For example, both the New S program (Chambers and 
Hastie, 1992) and the program JMP (SAS Institute) choose the model and fitting 
procedure based on the type of the predictor variables and on the type of the 
response. This simple use of functional polymorphism (dispatching the function 
on the types of the arguments) reduces the program specific knowledge needed 
by the use to operate the program. The user learns a single syntax for the single 
fit model command instead of separate commands for each model type (whether 
accessible via command line or menu, this is a large reduction in user memory 
requirements) . 

However, this simple dispatching only scratches the surface of what can be 
done with a full variable hierarchy. Figure 2 shows the top level of the NAEPVUE 
heirarchy. Figure 3 shows the NAEPVUE hierarchy for the factor variables. 
Many NAEP background questions are similar, for example “How often do you 
....” NAEPVUE creates abstract variable types for such similar questions which 
allows all related variables to share the same default attributes. Figure 4 shows 
the details for numeric meric variables. 



3 Statistical Advisors 

Statistical advice given by NAEPVUE needs to be sensitive to what has already 
been specified about the model. This is accomplished by a series of “Agents” — 
small pieces of code which are run after each user command. The agent can have 
one of two results: 

1. It can deliver a message to be displayed. For example, it could display a 
message informing the user that a teacher level variable is unsuitable for 
selection as a response. 

2. It can deliver a follow-on command object to be executed. For example, this 
mechanism can be used to select the appropriate model type and weights in 
response to the selection of a response variable. 

The list of agents appropriate to a situation is computed dynamically. In par- 
ticular, agents appropriate to a given situation are attached to variables. When 
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Fig. 3. Variable Hierarchy for Factor Types. Variables lower in the hierarchy 
inherit default values of attributes and statistical advisors from variables higher 
in the hierarchy. Another part of the type hierarchy (not shown) contains real 
and integer valued numeric variables. 



agent processing is triggered, NAEPVUE searches the variable type hierarchy 
looking for all agents attached to this variable or its parents in the hierarchy. 
Variables can also override agents specified higher in the hierarchy by indicating 
that they should not be run. 



3.1 Need for Advice 

Even an analyst with a strong statistical background can use assistance with the 
NAEP data. Several examples will illustrate the features. 

A whole class of issues revolves around tracking the level of analysis of the 
variables. NAEP cognative variables are collected at the student level and ag- 
gregated at the school, state and region levels. NAEP background variables are 
collected at the student, school and state level, as well as some non-sampling 
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Fig. 4. Variable Hierarchy for Numeric Types 



levels such as teacher and classroom. Because teachers are not sampled with 
equal weights, the teacher level responses are inappropriate for anything but 
crude exploratory research. As the level of the variable is one of the attributes 
tracked in the NAEPVUE data encyclopedia, NAEPVUE can do a lot of the 
bookkeeping about the levels of the variables. 

A closely related set of issues revolve around the required sampling weights. 
In general, the proper set of weights will depend on both the level of response 
variable (student, school or state) and what the target population of inference 
will be (i.e., do you apply poststratification). Here, preparing a menu of weights 
appropriate for each level as well as tracking which jackknife replicates are ap- 
propriate for variance calculations is a big help. 



3.2 Advisor Feedback 

For advice to be truly useful, the analyst must know not only what was done, but 
why. For this reason, all of the advisors have a common user feedback mechanism. 
At the bottom of the main screen is a small area for displaying alert messages. A 
message consists of three parts: (1) A “traffic light” which indicates its severity. 
(2) The text of the message. (3) An optional help reference which provides a 
link to on-line information about the nature of the problem or advice offered 
by the system. The messages are stored in a queue so that the user can scroll 
backwards through previous messages. 

Both immediate commands and advisory agents return messages. The mes- 
sage color indicates how it will be displayed and whether or not to abort the 
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user selected action. Normally, red light messages abort the user action, while 
yellow light messages give a warning. The user can configure the system to allow 
themselves more or less freedom to take actions contrary to the system’s advise. 
A special “infrared” color always aborts (system errors) and a “clear” color is 
used for transient status messages. 

3.3 Advisors and Type Hierarchy 

The advisors depend on the metadata in the data encyclopedia in a number of 
different ways. In particular, advisors can use attributes of the variables stored 
in the encyclopedia, or they themselves can be attributes of the variables. In 
either case, they can take advantage of inheritance among the variable types to 
make the task of specifying advice simpler. 

For example, consider the advisor which selects weights for the analysis. This 
primarily operates off of the level attribute of the response variable. Thus when 
the user specifies a response variable, the weighting advisor selects an appropriate 
set of weights, or if the selected response variable was at teacher or classroom 
level, issues a warning that no appropriate responses are available. As the level 
attribute is copied when a variable is transformed, the weight selection advisor 
can select an appropriate weight as well. 

Any time NAEPVUE performs any action for which there may be advisors, it 
walks up the variable type hierarchy looking for advisors which are appropriate 
to the context. Consider the problem of creating a context sensitive menu of can- 
didate transformations for a variable. Possible transformations can be attached 
at any level of the variable hierarchy. When building the list of candidates, the 
transformation advisor walks up that hierarchy, gathering candidate transfor- 
mations. 



4 Undo 

In the Amulet user interface environment, user manipulations of the interface 
produce command objects. A command object contains three “Methods” or 
pieces of code which are run on demand. The “Do Method” performs the action 
requested by the user. The “Undo Method” reverses the effect of the action and 
the “Redo Method” repeats the action. When the user performs some gesture 
which triggers a command - for example, pressing a button or selecting a menu 
item - the appropriate command object is created, filled in with details about 
the current context and its “Do Method” is activated. If the command processing 
does not abort, the command object is then placed in the undo queue so it is 
available for later undo or redo operations. 

However, statistical advisors complicate this simple model. After the action 
of the command is run, the advisors run. These may in turn make changes to 
the model, for example, automatically adding the appropriate weights when a 
response variable is selected. If the action triggering the advisor is undone, the 
advice commands must be undone as well. 
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Often commands appear in a nested series which in Amulet is called the 
Implementation hierarchy. (Myers and Kosbie, 1996). For example, consider the 
operation of moving a variable on the model display. If the variable moves into a 
new region, this causes the variable to change roles. In NAEPVUE, the “Change 
Role” command object is the implementation parent of the “Move Node” com- 
mand object. The highest object in the implementation hierarchy is the one 
queued for undo (or one can use a special top level object to prevent the com- 
mand from being queued for undo). On a redo, all of the commands in the hi- 
erarchy are undone (so in the example both the “Change Role” and the “Move 
Node” command would be undone). 

In NAEPVUE advisory agent invocation is an implementation parent of the 
active command. Command which should invoke the advisory agents are given 
a special “Invoke Agents” command as their implementation parent. The “Do 
Method” for this command has the following steps: 

1. Build a list of agents to invoke based on the variables currently specified in 

the model. 

2. Execute those agents one at a time. 

(a) If the agent returns a command object, invoke the do method of the 
command object and put the command into a queue of actions taken 
with this command. 

(b) If the agent returns a message object, display that message in the alert 
area. If the severity of the message is sufficiently strong, abort the com- 
mand. In this case, that means invoking the undo methods of all of the 
agent commands and well as the undo mechanism of the command which 
triggered the “Invoke Agents” command. 

The undo mechanism of the “Invoke Agents” command simply triggers the 
undo method for each command in the queue of agent actions recorded in the 
original invocation. The redo mechanism runs their redo actions. 

Example 1. Suppose that the user selects the ‘Data Analysis Scale Score’ vari- 
able and moves it from the “Pool” to the “Response” areas on the screen. This 
triggers the parent action which sets the role of the variable to ‘response;’ the 
parent action of the change role action triggers the advisors. ‘Data Analysis 
Scale Score’ is a continuous student level variable which has multiple plausible 
values. Thus three advisors will run after the selection: Advisor A selects a lin- 
ear response model. Advisor B selects the student level weights, and Advisor C 
configures the model for multiple imputations (averaging over multiple runs of 
the model.) 

Example 2. (undo!) If the user later undoes the selection of ‘Data Analysis Scale 
Score’ as a response variable, the following actions take place: (a) Advisor A is 
undone and the model type is changed back to its previous value (cached with 
the advisor’s command object), (b) Advisor B is undone and the student level 
weights are removed, (c) Advisor C is undone and the multiple imputation flag 
is cleared, (d) the selection of ‘Data Analysis Scale Score’ as a response variable 
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is undone, its rule is restored to ‘Pool’, and (e) the movement of the variables 
icon is undone and it is returned to its original location in the ‘Pool’ area. 

Example 3. Suppose the user attempts to move the variable “Teacher’s Educa- 
tion” into the response area. Suppose further that the system is set up to query 
the user on warnings. First, the system would move the “Teacher’s Education” 
icon into the response area. Second, the system would set the role of “Teacher’s 
Education” to “response.” Third, the advisory agents would be run. Agent A 
would set the type of the model to Generalized linear model. Agent B would try 
and find weights, but would discover that there are no appropriate weights for 
teacher level variables. It would issue a warning. As the user has selected query 
on warning, she would be offered the possibility of cancelling the action. If she 
selects cancel, then Agent A would be undone, as would the “Change Role” and 
“Move Icon” actions. 

5 NAEPVUE Experience 

One advantage of working in Amulet (Myers et ah, 1996) was the command 
object and undo mechanism. This forced me to consider the issues raised by undo 
at an early stage. (Retrofitting an undo mechanism onto existing operational 
software can be very expensive; personal experience.) 

In this approach, the statistical advisors essentially log their undo informa- 
tion with the command object which triggered them. Thus, the advice can be 
undone when the base command is. An alternative strategy it to always call the 
advisors based on the current model. Thus after an undo, new advisors would 
respond to the current state of the model, undoing the effect of previous advi- 
sors. Both approaches are feasible. As NAEPVUE is still in a prototype stage, 
it is difficult to judge the success of this approach. 
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Abstract. In this paper we present a new method for temporal knowl- 
edge conversion, called TCon. The main aim of our approach is to per- 
form a transition, i.e. conversion, of temporal complex patterns in mul- 
tivariate time series to a linguistic, for human beings understandable 
description of the patterns. The main idea for the detection of those 
complex patterns lies in breaking down a highly structured and com- 
plex problem into several subtasks. Therefore, several abstraction levels 
have been introduced where at each level temporal complex patterns are 
detected successively using exploratory methods, namely unsupervised 
neural networks together with special visualization techniques. At each 
level, temporal grammatical rules are extracted. The method TCon was 
applied to a problem from medicine, sleep apnea. It is a hard problem 
since quite different patterns may occur, even for the same patient, as 
well as the duration of each pattern may differ strongly. Altogether, all 
patterns have been detected and a meaningful description of the patterns 
was generated. Even some kind of ’’new” knowledge was found. 



1 Introduction 

In recent years there has been an increasing development towards more power- 
full computers, such that nowadays a great amount of data from, for example, 
industrial processes or medical applications, is gathered. These measured data 
are often said to be a starting point for an enhanced diagnosis or control of the 
underlying process. Particularly interesting for handling noisy or inconsistent 
data are artificial neural networks (ANN). On the other side, systems with tra- 
ditional artificial intelligence (AI) technologies have been successful in areas like 
diagnosis, control and planing. The advantages of both technologies are wide- 
ranging. However, the limits of these approaches, namely the incapacity of ANN 
to explain their behaviour and on the other hand, the acquisition of knowledge 
for AI systems, are important problems to be adressed. 

Recently, there has been an increased interest in hybrid systems that inte- 
grate AI technologies and ANN to solve this kind of problems 0. It its worth 

D.J. Hand, J.N. Kok, M.R. Berthold (Eds.): IDA’99, LNCS 1642, pp. 369-12^01 1999- 
[fc Springer- Verlag Berlin Heidelberg 1999 
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to remark here that essentially hybrid systems have been developed that entail 
several modules, each implemented in a different technology, and that cooper- 
ate with another. In contrast, we are mainly interested in hybrid systems that 
perform a knowledge conversion, i.e. a transition between distinct knowledge rep- 
resentation forms eg. A symbolic knowledge representation of a subject should 
always be in a linguistic, for human beings understandable form. Examples for 
linguistic representation forms are natural languages, as German or English, 
but also predicate logic, mathematical calculus, etc. In contrast, a subsymbolic 
knowledge representation always entails numerous elements as, for example, data 
points from a time series or neurons and weights in ANN that cooperate in a 
shared and distributed representation of a symbol. 

Previous approaches that realize a knowledge conversion cni, cu, eg, eni 
do not consider data with temporal dependences. Temporal knowledge conver- 
sion always assumes the existence of temporal data, i.e. time series sampled 
from signals that describe some process. All sampled values are a temporal 
subsymbolic knowledge representation of the time series. A temporal knowledge 
conversion is an, eventually, successive conversion of multivariate time series or 
temporal complex patterns in time series to a linguistic, for human beings under- 
standable representation of the time series, i.e. a temporal symbolic knowledge 
representation P). 

In this paper, we will introduce a new method that enables a temporal knowl- 
edge conversion, called TCon P]. In order to handle this complex problem, sev- 
eral abstraction levels have been introduced. We applied our method TCon to 
sleep apnea, namely sleep-related breathing disorders (SRBD). SRBD claim to 
be a very hard problem since quite different patterns for the same temporal pat- 
tern may occur, even for the same patient, and the duration of each temporal 
pattern may differ strongly, as well HH. H21. 

2 A Method for Temporal Knowledge Conversion 

The method TCon enables a conversion from temporal complex patterns (TCP) 
in multivariate time series to a linguistic, for human beings understandable tem- 
poral symbolic representation in form of temporal grammatical rules (see Fig.^). 
The main idea for the detection of TCP in multivariate time series lies in breaking 
down a highly structured and complex problem into several subproblems. The 
advantage of such a strategy is the resolution of this highly complex problem 
into several subtasks, now solvable at a more technical level. Therefore, several 
abstraction levels have been introduced where at each level TCP are detected 
successively using exploratory methods, namely unsupervised neural networks 
0. The detection process starts with the identification of primitive patterns, i.e. 
elementary structures in time series. At the following levels, the time dimension 
will be introduced smoothly in the detection process until the identification of 
TCP at the last abstraction level is completed. 

At the different abstraction levels temporal grammatical rules are generated 
for a linguistic description of all TCP. The advantage of a temporal symbolic 
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Fig. 1. Abstraction levels and steps of the method Tcon in P] 



knowledge representation in form of temporal grammatical rules is not only the 
acquisition of a for human beings apropriate representation of the TCP, but also 
the generation of a knowledge representation form that can be processed by a 
machine engine, like a prolog interpreter. In order to achieve both, the detection 
of TCP as well as their description at a symbolic level, we suggest a temporal 
knowledge conversion. Next, we will introduce the different abstraction levels of 
TCon and give an overview of the tasks. 

Multivariate time series gathered from observed signals of complex processes, 
as they occur in industrial processes or in medicine, are the input of TCon. We 
generate a multivariate time series by sampling the observed values at equal 
time intervals. The result of the method TCon are the detected TCP as well as 
a grammatical description of the TCP at different abstraction levels. 

For example, consider a patient with sleep apnea, namely sleep-related breath- 
ing disorders (SRBD), where different types of signals, concerning respiratory 
flow, i.e. ’airflow\ and respiratory effort, are registered during one night m 
The respiratory effort comprises 'chest wall and abdominal wall movements’ . 
Furthermore, ’snoring’ as well as ’oxygen saturation’ are considered for the iden- 
tification of SRBD. Fig. 13 shows such a registration for a short time period. 
All time series are sampled at 25 Hz. In this paper, we use this example from 
medicine to illustrate our method. 



2.1 Feature Extraction and Preprocessing 

First, an extraction of the main features for all time series is advisable, or even a 
prerequesite for further processing. Therefore, methods, for instance, from statis- 
tics or signal processing are applied to time series in order to And a suitable rep- 
resentation. This process usually includes a pre-processing of the time series such 
that a clustering with unsupervised neural networks becomes possible. However, 
for most practical applications the choice of an adequate preprocessing will be 
one of the most significant factors in determining the final performance of the 
system j2j . An improvement of the whole performance may be achieved by incor- 
porating prior knowledge, which might be used for the extraction of the features. 
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Fig. 2. Small excerpt of multivariate time series and resp. features from a patient 
with SRBD 



For each aplication the feature extraction process may differ strongly. Therefore, 
we will not focus on this issue in this paper. For a detailed description of the 
feature extraction see Nevertheless, it is worth to mention that we considered 
criteria from the application that usually are applied in sleep laboratories for the 
identification of SRBD [I I ) . 

As multimodal distributions occured in the data for each time series, namely 
’airflow', 'chest wall movements’ and 'abdominal wall movement’ , fuzzy member- 
ship functions for ’no’, 'reduced' and 'strong' averaged amplitude changes have 
been deduced from histograms. Additionaly, lags between 'chest wall movements' 
and 'abdomen wall movements’ may occur that have a high significance for the 
identification of the SRBD. Therefore, crosscorrelations between 'chest and ab- 
domen wall movements' have been calculated. Besides, a rescaling of 'snoring' 
was performed. As oxygen saturation is not relevant for the pattern detection 
process, we just will consider the ocurrence of a decay from at least 4% of the 
oxygen saturation for the past 10 sec. Altogether, twelve features named as 
'strong airflow G [0,1], 'reduced airflow' G [0,1], 'no airflow' G [0,1], 'strong 
chest wall movements' G [0, 1], 'reduced chest wall movements' G [0, 1], 'no chest 
wall movements' G [0, 1], 'strong abdomen wall movements' G [0, 1], 'reduced ab- 
domen wall movements' G [0,1], 'no abdomen wall movements' G [0,1], 'lag of 
chest and abdomen wall movements' G [—1,1], 'snoring intensity’ G [0,1] and 
'oxygen desaturation’ G {0,0.5,!} have been extracted (see Fig. EJ. 



2.2 Primitive Patterns 

At this level, primitive patterns, i.e. elementary structures in the time series, 
will be determined from the extracted features. Therefore, we propose to use 
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Fig. 3. Multivariate time series and resp. primitive patterns/succession from a 
patient with SRBD 



exploratory methods, in particular, self-organized neural networks (SONN) as 
proposed by Kohonen (S]. In the last years, SONN enhanced with a special 
visualization technique, called U-Matrix CHI , have been successfully applied to 
a wide-ranging number of applications where a clustering of high-dimensional 

data was afforded IHI, □, CHI, CH, CHI, m- 

For the detection of primitive patterns several features have to be selected in 
order to start the learning process. This means, that we have to identify those 
features that have a lot in common with regard to criteria from the applica- 
tion. But this also means, that several SONN will be learned to detect primitive 
patterns from different feature selections. We emphasize that one feature may 
appear in different feature selections. After the learning process and the identi- 
fication of the clusters using U-Matrices, we are able to determine the primitive 
patterns. There may appear regions on the U-Matrix that do not correspond 
to a specific cluster. These regions are regarded as some kind of interruptions, 
named as tacets. All the other regions are associated to a primitive pattern. 
Each example is only associated to one class, a primitive pattern class or the 
tacet class. As a consequence, we now are able to classify the whole time series 
with primitive patterns and tacets (see Fig. 0. Successions of primitive patterns 
from one U-Matrix will be called primitive pattern channel. 

Without a proper interpretation of the detected structures no meaningful 
names can be given to the primitive patterns. As a consequence, we cannot 
generate for human beings understandable temporal grammatical rules at the 
next higher levels. In order to achieve a meaningful description for primitive 
patterns, we propose to use machine learning algorithms. For the first time, the 
rule generation algorithm called sig* ini, was used to generate rules for data in a 
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temporal context. This sig* algorithm selects significant attributes for each class, 
in this case significant features for each primitive pattern, in order to construct 
appropriate conditions that characterize each class and generates differentiating 
rules that distinguish classes from each other, as well. It takes a classified data set 
in a high-dimensional space as input and produces descriptions of the classes in 
the form of decision rules. In particular, the generated rules take the significance 
of the different structural properties of the classes into account. If only a few 
properties account for most of the cases of a class, the rules are kept very simple. 
There are two main problems addressed by the sig* algorithm [ni. 

First, we have to decide which attributes of the data are significant in order 
to characterize each class. Therefore, each attribute of a class is associated with 
a ’’significance value” that can be obtained, for example, by means of statistical 
measures. In order to define the most significant attributes for the description of 
a class, the significance values of the attributes are normalized in percentage of 
the total sum of significance values of a class and sorted in a decreasing order. 
The attributes with the largest significance value in the ordered sequence are 
taken until the cumulative percentage equals or exceeds a given threshold value. 

Second, we have to formulize apt conditions for each selected significant at- 
tribute. For this problem we can use the distribution properties of the attributes 
of a class. Assuming a normal distribution for a certain attribute, this means 
that 95% of the attribute values are captured in the limits [mean -2*dev ,mean 
-|-2*dev ], where dev is the value of the standard deviation of the attribute. 

Until now, we just described the part of the sig* algorithm that produces 
characterizing rules. If the intersection of two classes is nonempty, an additional 
description of the intersection between the two overlapping classes is necessary. 
Therefore, we add to the characterizing rule of each class a condition that will 
be tested by another rule, called differentiating rule. These rules are generated 
in analogy to the characterizing rules. The significance values, however, are mea- 
sured between both classes in consideration. 

As our main aim is a generation of a meaningful description of the primitive 
patterns using sig* rules, we just considered characterinzing rules. Of course, 
both, characterizing and differentiation rules, have been generated. Example 1 
shows sig* rules for two primitive patterns from different feature selections. We 
emphasize that the complexity of each characterizing rule differs a lot, i.e. the 
number of features characterizing a primitive pattern, since the significance of 
each feature for each primitive pattern differs. In our case, the conditions for each 
selected significant feature have a meaning related to the occurence of the feature 
for the resp. primitive pattern. Values nearby zero mean that this feature will 
probably not occur, since zero means ”no occurence”. Values nearby one mean 
that this feature will occur with a high probability and, therefore, will be used 
for the generation of a name of a given primitive pattern. This is due to the 
feature extraction process, where mainly fuzzy functions have been used. For 
details see 0. 

Example 1. Consider the primitive patterns ’A2’ and ’B3’ that have been 
detected from different U-Matrices. The following sig* rules have been generated: 
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A primitive pattern is a ’A2’ 
if 

’no airflow’ in [0.951, 1] 

cind 

’reduced airflow’ = 0 

and 

’snoring intensity’ in [0, 0.241] 

A primitive pattern is a ’B3’ 
if 

’no chest wall movements’ in [0.772, 1] 

cind 

’no abdomen wall move-ments’ in [0.641, 1] 

cind 

’reduced chest wall movements’ = 0 

cind 

’snoring intensity’ = 0 



Values nearby one mean that this feature occurs with a high probabilty, while 
values nearby zero mean that this feature probably will not occur. 

As the sig* algorithm generates rules with the most significant features for 
each primitive pattern, the naming of the primitive patterns is straightforward. 
The primitive pattern ’A2’ was named as ’no airflow without snoring’ and ’ B3’ 
named as ’no chest and abdomen wall movements without snoring’. The names 
of the primitive patterns have been generated semi-automatically, as they corre- 
spond directly to the automatically generated sig* rules. We will see later that 
they are of crucial importance for the generation of meaningful grammatical 
rules at the next higher levels. Just now we are able to generate meaningful 
rules understandable for human beings as domain experts. 



2.3 Successions 

At this level, we introduce the dimension time where succeeding identical prim- 
itive patterns are regarded as a succession (see Fig. |3). Each succession has a 
correspondend primitive pattern type. The main diference lies in the fact that a 
succession additionaly has a start and end point and, consequently, a duration. 
Successions may be identified by trajectories visualized on U-Matrices. We will 
not focus on this issue in this paper. 

A consequence of several feature selections is that several SONN will be 
learned and, therefore, several U-Matrices will be generated. This means that 
two or more successions may occur more or less simultaneously. Two overlapping 
successions are said to occur more or less simultaneously, if and only if the 
deviation between their start and end points is small enough, i.e. very small. 
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2.4 Events 

At this level, more or less simultaneous successions are joined together to a new 
unity, called event. In order to focus on the most significant events, we distinguish 
between events that occur very frequently and those that occur less frequently. 
Rare events are omitted in the sense that they are regarded as interruptions. 
These will be named as event tacets. The idea is to select the most frequent events 
as the most significant events. Then, less frequent events can be associated to 
them. Similarities among the successions have been considered to join different 
types of more or less simultaneous successions, i.e. frequent and less frequent 
events This means that the number of events will be extremely reduced and 
that one event consists of different types of more or less simultaneous successions, 
i.e frequent and less frequent events, as well. 

As a consequence, at this abstraction level temporal grammatical rules not 
only entail a ’’more or less simultaneous” but also an ”or” for the description of 
alternations between more or less frequent events. Let us consider the example of 
the patient with SRBD. In this case, three events have been detected (see Fig.^J. 
Names of events can be derived straightforward from the generated grammatical 
rules, as the names for primitive patterns, i.e. successions, are already known 
(see Example 2). For a detailed description of the whole detection process and 
generation of the grammatical rules for the events see 0 . 

Example 2. The following grammatical rules have been generated for 'Event!' 
and 'Events'-. 

An event is a ’ Event 1’ 
if 

’no airflow without snoring’ 
is more or less simultaneous 

(’no chest aind abdomen wall movements 
without snoring’ 

and 

’tacets ’ ) 

An event is a ’ Events ’ 
if 

(’strong airflow with snoring’ 

and 

’reduced airflow with snoring’ 

and 

’tacets ’ ) 

is more or less simultaneous 

’strong chest and abdomen wall movements’ 

The ocurrence of tacets in the rules means that small interruptions in suc- 
cessions may occur or that a succession, for example, from one primitve pattern 
channel occurs simultaneously with irrelevant information at the other channel. 
A name of an event contains essentially names of the most frequently ocurring 
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Fig. 4. Multivariate time series and resp. events from a patient with SRBD 



successions. Names of rarely occuring successions may be dismissed, since the 
idea of temporal knowledge conversion also entails an information reduction for 
the generation of well-understandable rules. If needed, details may then be con- 
sulted at lower abstraction levels. The following names have been derived from 
the rules: 

— 'Event!’: ’no airflow and no chest and abdomen wall movements without 
snoring’ 

— ’ Event2’: ’no airflow and reduced chest wall movements and no abdomen wall 
movements without snoring’ 

— ’Events’: ’strong breathing with snoring’ 



2.5 Sequences 

At a symbolic representation level, an event may be interpreted as a symbol in a 
temporal context that cannot be further decomposed. Then, at this abstraction 
level a multivariate time series can be represented as a sequence of symbols, 
i.e. events. In order to be able to detect TCP in multivariate time series, we 
just have to identify repeated subsequences of events. The main problem lies in 
the identification of start and end events, in particular, when dealing with time 
series that entail several and distinct TCP. Therefore, we builded a probabilistic 
automat as well as considered delays between the ocurrence of two different 
events (see P]). 

A sequence of events together with the multivariate time series from the 
patient with SRBD is illustrated in (see Fig.^. For this SRBD we identified the 
following sequence (see Fig.Ej) where ’Events’ follows immediately after ’Event!’ 
and ’Events’ follows ’Events’ after a small interruption. 
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Fig. 5. A detected sequence from the patient with SRBD 



For the generation of the grammatical rules we introduced at this level a 
’’followed by” and ’’followed afted’ interval ” b]f’ . As a sequence always occurs 
more than once in a multivariate time series, lower und upper boundaries for 
the duration of the events and sequences may be specified (see Example 3) . 
Example 3 The following gramatical rule has been generated for ’ SequeneeV 

A sequence is a ’Sequencel’ in [40 sec, 64 sec] 
if 

’Eventl’: ’no airflow and no chest and abdomen wall 
movements without snoring’ in [13 sec, 18 sec] 
followed by 

’Event2’: ’no airflow aind reduced chest and no abdomen 
wall movements without snoring’ in [20 sec, 39 sec] 
followed after [0,5 sec, 5 sec] by 

’Event3’: ’strong breathing with snoring’ 
in [6 sec, 12 sec] 

Related approaches that have been used for the generation of grammars 
from time series usually just consider one time series, as for example EGG 's |2|, 
carotide pulse waves ^ and eye movements Furthermore, the main structures 
of the time series are usually known a priori as, for example, PQ segments or 
QRS complexes of an EGG signal. This means that no exploratory methods are 
needed for the detection of elementary structures in time series. In contrast, we 
not only generate grammatical rules from multivariate time series, but also use 
exploratory methods like unsupervised neural networks for the detection of the 
main structures in the time series, i.e. primitive patterns. 

2.6 Temporal Patterns 

Finally, similar sequences will be joined together to a temporal pattern. There- 
fore, similarities between ocurring events in the sequences as well as the duration 
of the events have been considered . As the example of the patient with SRBD 
just contains one sequence, the temporal pattern also just has one sequence. Oth- 
erwise, the temporal pattern would be described by an alternation of sequences 
using an ” od’ . 

3 Conclusion 

Recently, different kinds of hybrid systems that integrate AI technologies and 
neural networks have been developed P|. We emphasize that mainly ’’cooper- 
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ative” hybrid systems have been developed, i.e. a cooperation between several 
modules implemented in different technologies exists. The main difference to our 
approach is that in cooperative hybrid system no transition between different 
knowledge representation form takes place m- The hybrid system WINA m 
is an example for a hybrid system where a knowledge conversion for high dimen- 
sional data can be performed. This work was the starting point for the recently 
developed method for temporal knowledge conversion (TCon) 0. 

The main issue of the present paper was to give a brief description of the dif- 
ferent abstraction levels introduced by the method TCon. This approach enables 
a successively and, even, smoothly conversion of temporal complex patterns in 
multivariate time series to a linguistic, for human beings understandable tem- 
poral symbolic knowledge representation in form of temporal grammatical rules. 
In order to detect elementary structures in the time series, self-organized neu- 
ral networks, as proposed by Kohonen P], together with special visualization 
techniques, called U-Matrices HS|, have been used. The realization of the tasks 
at each level as well as the generation of the temporal grammatical rules was 
illustrated through an example from medicine, namely sleep-related breathing 
disorders (SRBD) [T2j. SRBD claim to be a very hard problem since quite dif- 
ferent patterns for the same temporal patterns may occur, even for one patient. 
Additionally, the duration of each temporal pattern can differ a lot. 

For a lack of space we could just give an overview of the method and present 
a small example of our experiments with SRBD. We used a much larger data 
base with the most significant, i.e. most frequently ocurring, SRBD. For details 
see 0. Altogether, we detected all temporal patterns with our method TCON 
and were able to give a, for an expert of SRBD, meaningful description of the 
temporal patterns with the temporal grammatical rules. Additionally, some kind 
of ’’new” knowledge for one temporal pattern, i.e. some not yet well-described 
SRBD in medicine, have been found. 
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Abstract. We present an approach to the problem of detecting intru- 
sions in computer systems through the use behavioral data produced by 
users during their normal login sessions. In fact, attacks may be detected 
by observing abnormal behavior, and the technique we use consists in 
associating to each system user a classifier made with relational decision 
trees that will label login sessions as ’’legals” or as "intrusions”. 

We perform an experimentation for 10 users, based on their normal work, 
gathered during a period of three months. We obtain a correct user recog- 
nition of 90%, using an independent test set. The test set consists of new, 
previously unseen sessions for the users considered during training, as 
well as sessions from users not available during the training phase. The 
obtained performance is comparable with previous studies, but (1) we 
do not use information that may effect user privacy and (2) we do not 
bother the users with questions. 



1 Introduction 

User behavior is probably the rawest form of data available to be processed and 
exploited. In the case of people using a computer system, monitoring and data 
collection can be made by specific programs, and we are then left to use the 
gathered information in a suitable way. “Behavioral Data” can of course be used 
to classify users and to distinguish them from each other and from unknowns, 
and the most obvious application of such form of classification is in the field 
of Gomputer Security. It is well known that access controls (such as through 
the use of passwords) are not sufficient by themselves to avoid intrusions, and 
the connection of computers to local networks and to the Internet is making 
intrusions not only possible, but more and more likely. In such a situation, we 
need a way to recognize a user as the legal owner of an account he/she is using, 
or as an intruder. Ideally, we should be able to do so as soon as possible, with a 
high level of accuracy, and possibly without affecting the privacy of the user. 

In this paper we present an approach to the problem based on the use of data 
collected through the monitoring of users, and processed via Relational Decision 
Trees P . We obtain a performance comparable with previous studies, but avoid- 
ing many of their drawbacks, such as the use of structured typing text0 and the 

^ That is, predefined text that each user involved in the experiment is required to type 
in order to “reveal” his/her own keystroke dynamics. 
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use of private informatioifl- The classification task is performed after only ten 
minutes from the beginning of the login session; the approach is very efficient 
and scales well with the number of users to be monitored. Finally, it is easily 
kept updated as users change their behavior and acquire new skills. 

Intuitively, an intrusion is a successful attempt to use system resources with- 
out proper authorization. As a consequence, an intrusion detection system can 
be seen as a classifier: it classifies a particular computer or account state as 
either safe or unsafe H 

To classify intrusions, one may either write a classifier manually, based on 
expert knowledge, or obtain the classifier from examples of user-system interac- 
tion. In the first case, the model must be updated manually when new users are 
authorized to enter the system, or when new attack paradigms become known. It 
should be clear that manual approaches are only suitable for very particular sit- 
uations, especially where a limited number of legal users with very slow-changing 
habits are involved. 

Inductively acquired classifiers, by contrast, could perform well also in large 
and dynamic environments. In fact, it is easy to obtain a large number of ex- 
amples of “normal” system and user behavior. From such a large number of 
examples it is possible to obtain classifiers that perform well on new, previously 
unseen cases. Specific intrusion detection methods have been proposed, that use 
neural networks |S|, genetic algorithms jOI, automata [HI and general statistical 
approaches m- The method presented in this paper is also of this kind, and 
uses heterogeneous data taken from normal user sessions in a real local network 
environment. The method is based on a relational decision tree machine learning 
system. 

Within the class of intrusion detection systems using classifiers obtained au- 
tomatically from examples, we consider user classification. The reason for this 
choice is the difficulty of obtaining negative examples of what we have called 
“normal” behavior. The examples of abnormal system behavior or of unautho- 
rized user operations are, and should be, rare. As a consequence, they cannot be 
used effectively for the purposes of automated induction or statistical analysis. 

^ Such as the knowledge of which files were edited and which words were typed. 

® More precisely, there are essentially two ways to realize that an intrusion is under way 
or has occurred recently: (1) some users or some known user processes behave in a 
way that is clearly unusual, e.g. a secretary starts running awk and gcc\ (2) a typical 
attack pattern is recognized, e.g. some user reads a password file or attempts to 
delete system logs. In the first case we speak of anomaly detection, while the second 
objective is defined as misuse detection. Both approaches have been investigated 
in the literature. Some recent anomaly detection systems may be found in |21 Q ISI 
I rzj . and misuse detection is discussed in jl til ll y| . Some systems combine the two 
techniques to achieve higher performances (e.g., lEI). However, it should be clear 
that misuse often implies some form of anomaly, unless a user is accustomed, e.g., 
to read password files and delete log files, as it could possibly be the case for system 
administrators. As a consequence, in many cases anomaly detection also includes 
some form of misuse detection. 
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We then choose to distinguish one legitimate user from other known legitimate 
users and unknown users, instead of recognizing a legitimate user as opposed 
to an attacker. For the aim of the experiment, the unknown are users who be- 
have “normally” w.r.t. their own account. Therefore, they would show potential 
anomalies when behaving as usual in someone else’s account, and hence can well 
represent intruders. As a consequence, we easily have available a large number 
of positive and negative examples of user behavior: for each legitimate system 
user, logs recorded during an interactive session represent the positive exam- 
ples, while the corresponding information for all the other users represents the 
negative examples. 

Techniques are available for automatically generating accurate classifiers from 
positive and negative examples, as developed in Machine Learning, Neural Net- 
work, and general Pattern Recognition research. Good classifiers will label the 
available examples as either positive or negative with a low number of errors. 
However, good classifiers should also make a limited number of errors on new 
examples, i.e. user information obtained in future interactive sessions. And, an 
even more demanding requirement, good classifiers should perform well even 
when new users are introduced into the system, users that were not used during 
the training phase. This is important, as the attacker may not be among the 
current authorized system users. 



2 Inductive Learning of a User Classifier 

For obtaining a user classifier from examples of “normal” user sessions, we use 
a method for the automated induction of decision trees. Together with neural 
networks and methods derived from genetic algorithms, decision trees are among 
the most accurate general purpose classifiers that can be obtained inductively 
0. They are also limited in size and very efficient to be learned and used, 
if compared to other powerful learning techniques (such as, e.g., Horn rules 
learners 0). However, they have not been used much in intrusion detection. 
The only study we know of in this area is [7|, where decision trees are used for 
classifying connection types (e.g. SMTP vs telnet) from network traffic data. 
For our problem, a decision tree may be seen as a procedure that classifies 
patterns representing user’s behavior and operations, as either related to one 
specific user (positive), or to another user (negative). Patterns will be described 
by means of so-called attributes, i.e. functions that take a finite number of values. 
For example, an attribute could have as its value the average typing speed of 
the user, and another attribute could have as its value a symbol indicating the 
specific command the user typed first during one login session. A node of a 
decision tree will correspond to some attribute, and the arcs from this node 
to its children correspond to particular values of the attribute. Leaves will be 
associated to a classification, positive or negative. A decision tree can then be 
used for classifying a user login session s as follows: we start from the root of the 
tree, and evaluate the corresponding attribute for s obtaining a value v; then we 
follow the arc labeled by v and reach the corresponding node; then, we repeat 
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the procedure until a leaf is reached, and output the associated classification, i.e. 
the decision tree finally classifies the session as either belonging to the chosen 
user (a positive classification) or not (a negative classification). We can obtain 
one decision tree for each user. 

For our application we have used ReliC |P, a system that learns relational 
decision trees from both positive and negative examples. Relational decision trees 
(RDTs) differ from traditional (propositional) decision trees, in the possibility 
to deal with relations, and not only with attributes. Under this definition, we 
can say that RDTs are a generalization of decision trees, because the latter are 
able to work only with unary relations (attribute- value representations). (In 0, 
the even more general notion of Logical Decision Tree is defined, and it is shown 
how to map a logical decision tree to a logic program and vice-versa.) 

ReliC is based on the older, but excellent work represented by IDS H3 and 
C4.5 C4.5 has become a standard reference in decision tree learning: it 

has sophisticated pruning mechanisms, and performs well on most applications. 
Moreover, it is readily available and the implementation is robust. ReliC differs 
from C4.5 in offering the following advanced options: 

— n-ary relations with n D 1 can be used directly by the system; 

— the basic C4.5 post-pruning algorithm can be substituted with a stopping 
criterion which limitates the expansion of the tree; 

— a database interface is provided with the system in order to perform queries 
for data stored off-line in a DBMS, so as to deal with a very large number 
of structural examples, as is required in intrusion detection. 

— an initial weight can be assigned for each class of examples during learning 
phase; this is useful in contexts where some examples are considered as more 
important. In our experiments, positive examples were given higher weights, 
because they were less numerous. 

For the experimental data that we have used, the relational characteristics of 
the system may have improved the discrimination performances w.r.t. the basic 
C4.5 system. The experimental setting is discussed in the next two sections. 

3 Data Acquisition 

The data used in our experiments were collected over a period of three months. 
Ten volunteers in our department accepted to be monitored as described below. 
The volunteers were asked to behave as usual, with the only constraint of not 
allowing other people to sit at the keyboard and use the workstation with the 
volunteer’s account. The monitored people included two system administrators, 
one PhD student, one professor and six researchers. Each user is monitored with 
two programs that are launched when the user logs on and runs the X server. 
After ten minutes, the programs save the relevant data and stop0. 

The proposed method, however, would be applicable also to different choices: 1) any 
windows-like platform that allows for keystrokes to be captured may be used as a 
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The first program, time, connects to the X server and gets the elapsed time 
between every two keystrokes on the keyboard. To guarantee privacy, typed 
characters are blurred and are not recorded. Since we want the average elapsed 
time in continuous typing, time does not take into consideration times larger 
than 600 milliseconds. For such cases we assume the user stopped typing (she 
is reading her mail or has gone to take a coffee). Elapsed times are summed 
up together, and after ten minutes this sum is divided by the number of typed 
keys. Average elapsed time between two strokes and the number of strokes are 
recorded and then the time process dies. 

The second program, command, records the commands executed by the user 
in the first ten minutes of his working session, together with the number of times 
each command was run. This is done through the lastcomm command provided 
by the Unix System. Lastcomm gives information on previously executed com- 
mands on the system, but at the same time provides a reasonable level of privacy, 
since it does not report the arguments of the commands (for example, through 
lastcomm we may see that a user used ‘vi’, but cannot know which file was 
edited). It must be observed that lastcomm, if active, can be used by every user 
of a system. Hence, command does not use more information than that normally 
available to every non-root user of a Unix systemH. 

Together with the above information, the login time is also recorded. Hence, 
after the first ten minutes of a session, each user produced a set of parameters 
such as the following: 

user: User-1; 
login time: 09:15; 
number of keystrokes: 157; 

average elapsed time between keystrokes: 243 (milliseconds) 

command: cat, howjmany: 3; 

command: elm, howjmany: 1; 

command: more, howjmany: 3; 

command: rm, how-many: 2; 

Every such set is a positive example of User-1, and a negative example for every 
other user. 

Every example must be turned into something that the learning system can 
handle. The first three parameters can immediately be used as continuous at- 

platform; 2) instead of stopping after an elapsed time of 10 minutes, the programs 
could stop after a certain number of commands has been typed, or after a certain 
number of typed user actions have been performed; 3) mouse clicks labeled with 
corresponding actions could be used instead of, or in addition to, simple Unix com- 
mands. Our experiments prove that users may be characterized and classified on 
the basis of their interaction with the system, but the kind of interaction that is 
monitored may be tailored to the installation environment. 

® For the sake of truth, it must also be said that lastcomm is also one of the most 
hated Unix command by system administrators, since it tends to produce a very 
large accounting file in a short time. 
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tribute^ by the learning system. To handle the commands we must split them 
into a set of classes. There are almost five hundred commands in the SunOS 
release of the Unix systems, and it is not practical to make a class for each com- 
mand. Hence, the commands are grouped into a set of classes, where each class 
contains ‘homogeneous’ commands. For example, a class contains command used 
to see the contents of files. Commands such as cat and more belong to this class. 
In the above instance, User-1 used commands in this class 6 times. Commands 
used to modify files as a whole (such as cp and mv) form another class. In the 
given example, User-1 used one such command, rm, twice. We initially identified 
24 classes of Unix commands. Later, they were increased to 37. 



4 The Experiments 

Given a set of positive and negative examples of a user, our goal is to synthesize 
a decision tree representing a model of that user. When given in input a new 
example, the decision tree must be able to correctly classify it as a positive or 
negative example of the user. 

It is very important to note that we totally ignored some of the available 
users during the training phase - these users were only used for testing the sys- 
tem performance, and are equivalent to external, previously unknown intruders. 
More precisely, six users were selected to learn a model, let us call these the 
known users. The remaining four were left out to be used only in the testing 
phase, as explained below. We will call them the unknown users. The set of 
positive examples of each known user was randomly split into a training set con- 
taining 2/3 of the examples, and a test set containing the remaining example^ 
A decision tree for a known user was learned from a set of positive and negative 
examples of that user. The training set of this user’s examples were used as 
positive examples, and the training sets of the other 5 known users were used as 
negative example^. The learned decision tree was then tested on the set of the 
testing examples, in order to compute the percentage of positive examples of the 
user classified as positive, and the percentage of negative examples classified as 
negative. 

The testing set for each known user was made by putting together: a) the ex- 
amples of that user not used in the learning phase (these examples were marked 
as ‘positive’); b) the examples of the other five known users not used in the 
learning phase (marked as ‘negative’); c) all the examples of the four unknown 



° Actually, the login time is first turned into the number of minutes from midnight. In 
the given example it becomes the number 555. 

^ For each user there was a total number of positive examples varying from 15 to 90: 
the number of times he logged in at his workstation under X during the three months 
of monitoring. The total amount of examples from all the users amounts to 343. 

® Recall that a total of 10 users was available: 6 were called “known” and were used 
for training, the others were called “unknown and used only for testing”, making 
them equivalent to external intruders 
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users, marked as ‘negative’. The presence, in the testing sets, of negative exam- 
ples of unknown users is important, because it simulates the real situation when 
an intruder is coming from the outside, and hence his behavior is completely 
unknown to the ’guards’. For this reason we selected the four unknown users to 
be as heterogeneous as possible. The four unknown are the professor, the PhD 
student, one of the researchers and one of the system administrators. As it is 
common in Machine Learning, the learning/testing process just described was 
repeated 6 times for each user, each time with a different random split of the set 
of his examples into a training and a testing set. We then computed the mean 
error rates out of the six runs available for every known user. 

Moreover, the whole procedure was repeated in six different experiments. 
Experiments differ because of the attributes used to describe the users. The 
outcomes for these experiments are reported in table ^ and are discussed in the 
next section. Each entry of the table reports the mean percentage of positive and 
negative examples that are classified correctly. As an example, table reports 
the outcomes for the six known users in the last experiment (exp-6) of tabled 



Table 1. Experimental results. Positive (negative) accuracy is the percentage 
of positive (negative) examples in the test set that are classified correctly. Total 
Errors is the percentage of positive and negative examples that are not classified 
correctly. (The total error rate is not the mean of positive and negative error 
rates, as there are more negative examples.) 



Exp. 


Pos. Accuracy 


Neg. Accuracy 


Total Errors 


Exp-1 


73.3% 


89.3% 


11.8% 


Exp- 2 


80.5% 


90.3% 


10.2% 


Exp- 3 


82.8% 


89.7% 


10.4% 


Exp- 4 


81.6% 


90.6% 


10.1% 


Exp- 5 


84.1% 


90.3% 


10.2% 


Exp- 6 


85.4% 


89.3% 


10.9% 



Table 2. Exp-6 results 



User 


Pos. Accuracy 


Neg. Accuracy 


Total Errors 


researcher 1 


79.0% 


95.0% 


6.2% 


researcher 2 


88.3% 


86.3% 


13.3% 


researcher 3 


94.5% 


85.3% 


14.1% 


researcher 4 


81.7% 


90.3% 


10.2% 


researcher 5 


75.2% 


83.2% 


17.6% 


sys. admin. 


93.8% 


95.8% 


4.2% 


average 


85.4% 


89.3% 


10.9% 
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The attributes used in the six experiments to describe every example of every 
user are as follows. 

— Exp-1: login time, average elapsed time between keystrokes, number of key- 
strokes, 24 attributes representing classes of Unix commands (these are bi- 
nary attributes: the value is 1 if the user used one command of the corre- 
sponding class at least once. It is 0 otherwise). 

— Exp-2: as in Exp-1, but with 37 classes of Unix commands. 

— Exp-3: as in Exp-2, but the average elapsed time between keystrokes is taken 
into consideration only if the number of keystrokes is larger than 100. 

— Exp-4: as in Exp-2, but the Unix commands are counted, so that each at- 
tribute indicates the number of Unix commands of the corresponding class 
run by the user. 

— Exp-5: as in Exp-4, but the average elapsed time between keystrokes is taken 
into consideration only if the number of keystrokes is larger than 100. 

— Exp-6: as in Exp-5, but the login time is not taken into consideration. 

The classification rules synthesized by the learning procedure are normally mean- 
ingful and easy to understand from a “human” point of view. As an exam- 
ple, consider the decision tree learned in the first run of experiment Exp-6 for 
researcher-4. This tree can be translated into a set of nine clauses. The first two 
of these clauses correspond to the following rules: 

If monitored user run a command in class 35 he is not researeher-4 
If monitored user run a command in class 30 he is researcher-4 

In fact, class 35 contains commands such as accton, Ipc and Ipstat, that are 
typical commands for system administration, and hardly used by normal users. 
On the other hand, researcher-4 is, in real life, an experienced C-programmer, 
and class 30 contains command such as make, lint and ctrace^ 

5 Discussion of the Results 

By looking at table ^ we immediately notice an improvement on the ability to 
classify the positive examples from the first to the second experiment. This is 
due to a better classification of the Unix Commands into a set of classes. The 
first experiment is made by using 24 classes of commands, that was raised to 37 
in the second experiment (and kept in the other experiments). 

To partition the commands into classes, we just looked at the whole set of 
Unix commands and empirically split them into homogeneous sets. For example, 
a set contains basic commands used to move around in the file system, such as 
cd and pwd. Another set contains the printing commands, such as Ipr and Ipq; 
and another includes commands used to change files and directories, such as cp, 

® Actually, because of the second rule listed, we must gather that researcher-4 was the 
only one using commands in class 30, at least during the monitoring period. 
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rm, mkdir and rmdir. After the first experiment, we looked at the 24 classes, 
and further split some of them up to 37, because some of the initial classes were 
clearly ‘oversized’ and meaningless. As an example, there was a class containing 
the commands used to handle mails, consisting of elm, mail and mailtool. Though 
reasonable, this class was probably wrong. A user tends to use always the same 
command to read an answer mails, and avoid the others, especially because these 
commands have different interfaces, and allow a lot of options that require time 
to be learned and used. Hence, we made three different classes, one for each of 
the three mail commands. The adopted partitioning seems reasonable, but is for 
sure also questionable. For example, we have a class of basic security commands 
containing passwd, su, and login. Whereas the first command can be used by 
any user to change her own password, the use of the other two may suggest the 
user knows someone else’s password. Hence, it could also be reasonable to put 
passwd and the other two commands in different classe^3. 

The third experiment takes into consideration the elapsed time between two 
keystrokes only if the number of keystrokes is larger than 100. The obvious 
rationale is that when only few characters are typed the corresponding typing 
speed is not really meaningful and can be misleading. 

In the fourth experiment we tried to take into account the number of times 
commands of each class were used, and not only whether they were used or not. 
By itself, this information does not seem to be meaningful, and actually results 
into a slight decrease of the predictive power on the positive examples. However, 
combining the information with the threshold on the number of keystrokes (fifth 
experiment), results into an improvement of the outcomes w.r.t. the previous 
experiments. The reason lies probably in the fact that, in ten minutes, a user 
does not normally run a large number of commands. Hence, knowing how many 
times a particular command was used, does not bring more information than 
knowing whether that command was used or not at all. But when the number of 
keystrokes is (relatively) large, then presumably also the number of commands 
increases, and becomes relevant and useful. 

Obviously, a 90% user recognition rate is still inadequate for a fielded intru- 
sion detection systemO However, it must be observed that our homogeneous 
test environment did not help in the classification process. Most of the users 
are academic people with essentially similar habits and using the same hardware 

Clearly, this is a point where the knowledge of an expert — a system administrator 
— would greatly help to define a meaningful set of command classes. Moving from 
24 to 37 classes of commands, we also noticed that for every user the results of each 
run became more stable. That is, the predictive power of the learned tree for a user 
was roughly always the same regardless of the random set of examples used to learn 
the tree. 

Also, we well understand that a number of ten users involved in the experiments 
is rather limited. Actually, these users are all those who accepted to be monitored, 
whereas other refused because were afraid of possible infringements of their privacy. 
In general, we think this being a very important point that must be faced. Users 
must understand and accept that every security policy must imply, in some way, a 
limitation of their privacy. 
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and software platform. It is likely that classification of external users would yield 
significantly lower error rates. Consider Table 2. The only system administrator 
is recognized with a error rate of about 5%. On the other hand, the remaining 
five researchers are recognized with error rates of 16.3% and 12% for the positive 
and negative accuracy, on the average. If we assume these people having similar 
habits, we may reasonably expect larger errors when trying to distinguish every 
one of them among each other FI 

Finally, one may observe that a decision tree used to model a user may sooner 
or later become out of date, as users’ habits tend to change (though slowly) over 
time. New skills are acquired, new programs and commands are used, others are 
abandoned. This is in itself not a real problem. Suppose a decision tree for a 
user has been built using - let’s say - a set made of the last 30 log in’s of that 
user (plus a set of negative examples automatically provided by the monitoring 
of other users). When the user logs in and, after ten minutes is recognized as the 
legal user, the new example replaces the oldest one in the set of positive examples 
of that user, and a new decision tree is synthesized. This task requires just a few 
seconds, (whereas using the decision tree is virtually not time consuming), and 
the older model can be replaced. For the same reason, the method is scalable to 
larger environments, as one decision tree must be generated for each user, and 
complexity grows linearly with the number of legal users. 



6 Comparisons and Conclusions 

The method presented and the above experiments show that we can distinguish 
effectively one user from other known and unknown users, based on general char- 
acteristics such as typing speed and command history. We observed a 10 percent 
error on an independent test set for about all of our experiments. Higher recog- 
nition rates can be obtained if the precise latencies and duration of keystrokes 
can be measuredO Legget et al. nn report a 5.25 percent error for 36 users, 
who were required to type the same text, consisting of 537 characters. Brown 
and Rogers p| used a neural network approach to obtain one-sided errors be- 
tween 12 and 21 percent. In this case, users were asked to type names of only 
15 characters, in order to create the training data. Frnell et al. 0 share some 
of the objectives of our study, and obtain an 85% impostor detection rate, us- 
ing only keystroke analysis data. Monrose and Rubin m obtain a 90% rate 
of correct user recognition, but their experimental setting is different because 

In the learning phase of the classifier of a researcher, about 84% of the negative 
examples used belong to other researchers. In the testing phase, only about 60% 
of the negative examples are from other researchers. This happens because, in the 
testing phase, unknown users’ examples are included as negatives, and only one of 
these unknown is a researcher. This also explains why the negative accuracy is better 
than the positive accuracy. 

Keystroke latency is the elapsed time between every pair of specific typed keys. 
Keystroke duration is the time a key is held pressed down during typing. 
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given, “structured” typing text is required, and keystroke duration is usedF^ 
However, keystroke duration can only be measured on a local keyboard, as key 
release interrupts are not available for a network connection. But even latency is 
not easily used for network connections: even small communication delays make 
time intervals between individual characters unreliable. Average typing speed 
is less affected by network delays, although it may be compromised by severe 
bandwidth restrictions. 

There are of course many ways to improve the results. First, it is possible to 
update and modify a decision tree ‘by hand’, because decision trees, as opposed 
to neural networks and standard genetic classifiers, are easy to understand and 
edit. A very precise model of a user can be built in that way, but extending such 
procedure to a large set of users would be quite exhausting and time consuming. 
Second, improvements are possible along the line adopted in our experiments. 
In section El we observed that the log in time did not help to classify the users 
monitored in our experiments. However, a real intruder would be inclined to 
masquerade under some account when the legal owner is not connected, so as to 
avoid manual detection. The intruder would then be forced to connect rarely or 
to login at times that are unusual for the legal account owner. Login time would 
then be a useful attribute in fielded system installations. Also, an intruder would 
probably show a high level of activity from the very beginning of the connection, 
and this would lead to other useful decision tree attributes. Other parameters 
could be useful and, in particular, there is at least one information that would 
greatly improve the performances: the argument(s) of commands. Used files are 
particularly meaningful in this sense. We have not used file attributes in our 
experiments so as to protect user privacy. However, one thing is to know that 
a user is just running an editor. Another thing would be to know whether he is 
editing one of his files or (let’s say) /etc/passwd. 

Attacks are often successful just because no monitoring procedure has been 
activated, and because different intruding techniques are used. Therefore, it is 
important to study different forms of intrusion detection that can also be com- 
bined together to achieve a better performance. In this paper we have showed 
that heterogeneous data produced by normal user behavior can be used to detect 
anomalies and intruders, and can hence be useful to improve the safety of our 
systems. 

Acknowledgements: This research was partially supported by Esprit Project 
20237 ILP^: Inductive Logic Programming H. We want to thank all the people 
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Notice that Fmell’s approach infringes in some way users’ privacy, since characters 
typed by each user must be recorded in order to recognize his/her keystroke dynam- 
ics. The other methods bother the users by asking them to type a predefined text. 
On the contrary, our approach is essentially transparent to the users. 
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Abstract. One of the most common problems encountered in agricul- 
ture is that of predicting a response variable from covariates of interest. 
The aim of this paper is to use a Bayesian neural network approach to 
predict dairy daughter milk production from dairy dam, sire, herd and 
environmental factors. The results of the Bayesian neural network are 
compared with the results obtained when the regression relationship is 
described using the traditional neural network approach. In addition, 
the ’’baseline” results of a multiple linear regression employing both fre- 
quentist and Bayesian methods are presented. The potential advantages 
of the Bayesian neural network approach over the traditional neural net- 
work approach are discussed. 



1 Introduction 

Many different genetic and environmental factors affect the profitability of dairy 
herds in Australia. Production traits of the individual animals (eg. milk, fat 
and protein yields), other animal traits such as ’’workability” traits (eg. tem- 
perament) and type traits (eg. size), environmental influences such as climate, 
season, feed availability, and management practices, all contribute to the prof- 
itability of the herd. Complex relationships exist between fertility, milk yield, 
lactation length and culling which are difficult to model using conventional tech- 
niques | 22 |. Australian dairy farmers face a difficult task in attempting to com- 
bine all available information in a mating strategy that will increase profitability 
in their herds. 

Gianola and Fernando 0 describe the objective of a breeding program to 
be the elicitation, by selection, of favourable trends in a ’’merit function”. The 
larger issue of selection can be broken down into three sub-problems, each of 
which must be addressed by the animal breeder for such favourable trends to 
occur. These sub-problems can be described as: 

a) The definition of the breeding goal (or determining worthwhile genetic chang- 
es), which in the context of the dairy industry can be expressed in terms of 
the value of milk, fat and protein yields of dairy cows. 



D.J. Hand, J.N. Kok, M.R. Berthold (Eds.): IDA’99, LNCS 1642, pp. 395-|52SI 1999. 
t Springer- Verlag Berlin Heidelberg 1999 
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b) The estimation of breeding values (or correctly and efficiently identifying ge- 
netically superior animals), currently provided by the calculation of breeding 
and production values, using performance and pedigree information. These 
breeding values can then be used in the calculation of a selection index for 
use in sire ranking or dam appraisal. 

c) Mate selection and mate allocation (identifying the most genetically and 
economically efficient process of matching the selected animals) has always 
proved the most difficult part of the animal breeding equation to solve, due to 
the dynamics, and in particular, the non-linearity of the problem. There are 
two sub-problems to be addressed here. The first is the prediction problem, 
where the traits of the progeny are predicted from the traits of the sire, dam 
and the environment. The solution to this problem rests on the solutions to 
(a) and (b) above. The second component is an optimisation problem, where 
a simultaneous solution is attempted for mate selection and mate allocation. 

This paper is a product of a collaborative project between the Queensland Uni- 
versity of Technology Machine Learning Research Centre and School of Math- 
ematical Sciences, and the Queensland Department of Primary Industries. The 
aim of the project is the solution of (c) above, through the development of a 
PC based stand-alone program which can predict the optimal mating strategy 
for mating dairy sires with dams to maximise a ” merit function” . This paper 
investigates a role for machine learning, in particular the contribution of neu- 
ral networks (NNs) and Bayesian statistics, in the solution of the prediction 
problem. The output of the adopted prediction model will be used as input to 
the optimisation model in an intelligent decision support environment j2j. There 
are five objectives in this optimisation model; maximisation of the profit index, 
the selection index and the profit obtained from culling, and minimisation of 
inbreeding and semen cost. The prediction model is used to formulate the first 
two objectives in the optimisation model. 

The current approaches to prediction and previous applications of NNs and 
Bayesian statistics to the dairy prediction problem are discussed in the remainder 
of Section 1. Section 2 contains an outline of the methods used. Section 3 the 
results of the study, and Section 4 a discussion of the results, conclusions and 
future work. 



1.1 Current Approaches to Prediction 

Since genetic improvement through selection depends on correctly identifying in- 
dividuals with the highest true breeding value, the accurate prediction of breed- 
ing value constitutes an important component of any breeding programme. Ge- 
netic parameters, such as heritabilities and genetic correlations, which are neces- 
sary for the computation of breeding values, have been established for milk yield, 
survival, workability and type traits for Australian Holstein-Friesian and Jersey 
cattle Estimated Breeding Values (EBVs) or Australian Breeding Values 
(ABVs) are computed by the Australian Dairy Herd Improvement Scheme (AD- 
HIS) using the Best Linear Unbiased Predictor or BLUP fj. BLUP has become 
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the most widely accepted method for genetic evaluation of domestic livestock 
0. Using animal model BLUP, the genetic merit of an animal is predicted from 
its own and its relatives’ records compared with records of other animals after 
adjusting for environmental and managerial factors. When selecting bulls and 
cows for breeding, an optimum combination of all EBVs is sought. A standard 
method for combining information on different traits for selection purposes is 
the selection index (SI) 1101 . 

The problem addressed in this paper is prediction of the performance of 
daughters from a particular mating. This requires not only information about 
past performance of the parents and their relatives (integrated using BLUP) but 
also additional herd and environmental information. 



1.2 The Application of Neural Networks to the Dairy Prediction 
Problem 

The problem of prediction of daughter milk production from sire and dam pro- 
duction records appears to have many characteristics which might make an 
NN solution more attractive than that obtained using other machine learning 
paradigms. NNs have a tolerance to both noise and ambiguity in data ^1]. The 
dairy database, like most agricultural data sets, is inherently noisy, and is a col- 
lection of indicators that represent genetic and environmental influences includ- 
ing climate and farm management 0. NNs can be applied to problems which 
require consideration of more input variables than could be feasibly handled 
by most other approaches m, a potentially important issue with the present 
dairy problem. NNs have the ability to approximate non-linear relationships be- 
tween sets of inputs and their corresponding sets of outputs [m- The dairy data 
could be expected to display some degree of non-linearity. The ability of NNs to 
generalise well HSl and to learn concepts involving real-valued features p] are 
potential advantages with this project, since the predicted daughter responses 
are continuous variables. However an attempt has been made to categorise this 
data into discrete classes and analyse it using symbolic learning paradigms ■ 

On the other hand, many machine learning researchers regard the ’’black- 
box” nature of the learning carried out by NNs as a major disadvantage to their 
use as a learning paradigm. Such researchers argue that the concepts learned 
by NNs are difficult to understand as they are hidden in the architecture of the 
network. Nevertheless, there has been some success in identifying the task learned 
by the network by the extraction of symbolic rules jS). A second disadvantage 
of NNs is the high cost of the learning process, which can require large and 
general sets of training data which might not always be available and which 
can be very costly in terms of the time needed for the learning to take place 
m- Thirdly, the adaptivity M of the most commonly used NN, the multilayer 
perceptron, does not extend to the architecture of the model chosen. 

The potential advantages of a NN solution to the dairy prediction problem are 
explored in this paper, as well as an investigation of the comparative advantages 
of a Bayesian framework applied to the NN. 
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1.3 The Application of Bayesian Statistics to Nenral Networks 

The essential characteristic of Bayesian methods is their explicit use of proba- 
bility for quantifying uncertainty in scientific analysis. Gelman et al. 0 break 
down the process of Bayesian data analysis into three steps: 

1. Setting up a full probability model: a joint probability distribution for all 
observable and unobservable quantities in the problem. 

2. Conditioning on observed data: calculating and interpreting the appropriate 
posterior distribution given the data. 

3. Evaluating the model fit: assessing the implications of the posterior distri- 
bution. 

MacKay m and Neal describe a Bayesian neural network (BNN), where a 
probabilistic interpretation is applied to the NN technique. This interpretation 
involves assigning a meaning to the functions and parameters already in use. In 
the Bayesian approach to NN prediction, the objective is to use the training set of 
inputs and targets to calculate the predictive distribution for the target values in 
a new ’’test” case, given the inputs for that case. According to these authors, the 
hybrid approach of a Bayesian framework applied to the NN overcomes many of 
the disadvantages of NNs previously discussed. The Bayesian framework allows 
the objective evaluation of a number of issues involved in complex modelling 
including the choice between alternative network architectures (eg. the number of 
hidden units and the activation function), the stopping rules for network training 
and the effective number of parameters used. MacKay m postulates that the 
overall effect of the Bayesian framework should be realised in the reduction in 
the high cost of the learning process in terms of the time needed for the learning 
to take place. The framework allows for the full use of the limited and often 
expensive data set for training the network. 

The Bayesian techniques used in this paper employ Markov Chain Monte 
Carlo (MCMC) methods to simulate a random walk in 6, the parameter space 
of interest. The random walk converges to a stationary distribution that is the 
joint posterior distribution, p(9 \ y), where y represents the observed or target 
data. 

2 Methods 

2.1 Data Pre-processing 

The original data set was obtained from the AD HIS in 1997. The data set con- 
sisted of 49 text files containing both raw and summary data of milk production 
in dairy herds around Australia. The subset of data used in this paper applies to 
Holstein dairy cattle from the State of Victoria. Records were filtered to remove 
those containing incomplete milk volume, fat and protein data, those records that 
lacked sire and dam information, and also to include only those records where 
the number of test days per lactation was greater than seven. Exploratory data 



Bayesian Neural Network Learning for Prediction 399 



techniques including Stepwise Linear Regression (SLR) and Principal Compo- 
nent Analysis (PCA) were carried out on the resultant data set to determine 
which variables were to be included in the final feature set. These analyses in- 
dicated that some degree of non-linearity existed in the dam season of calving. 
Dam second milk was used in place of dam first milk because of the large amounts 
of missing data in the latter, and because of the high correlation between the 
two (r = 0.81). The final feature set included dam second milk yield, sire ABV 
for milk, dam herd mean milk yield excluding first lactation and dam season of 
calving (autumn, summer, winter, spring). Age adjustments were carried out on 
dam second milk following m- In the final data set, dam season of calving was 
represented as a sparse-coded variable. For the NN, BNN and BLR methods, 
the continuously-valued variables (representing dam, sire and herd information) 
were linearly transformed to values between zero and one. Due to the in-built 
tanh activation function of the BNN software package used, a transformation of 
these variables to values ranging between -1 and -1-1 was necessary. 

The final data set contained 20682 data records in the training set, and a 
further 5204 records in the test set. The record for the animal consisted of 
seven input variables, = (x„i, . . . , a;„ 7 ) with Xni representing the following 
variables: 

1. Dam herd mean milk yield excluding first lactation (Xni) 

2. Dam second milk yield (Xn 2 ) 

3. Sire ABV for milk yield (xns) 

4-7. Dam season of calving (autumn (a;„ 4 ), summer (xns), winter (xne), and 
spring (xn?))- 

Each record also contained the daughter milk yield for the first lactation 
(hereafter referred to as daughter first milk yield), the variable for prediction. 



2.2 Analytical Approaches 

All four approaches employed the same primary model, which represented daugh- 
ter first milk yield for the animal as y„,n = 1, . . . ,20682. In each model, 
Pn = Mn + 6n where represents the expected value of hereafter referred 
to as ijn- Here e„ ~ A(0, cr) with a being the standard deviation of yn- The full 
training set of 20682 records was used in all cases, with the full test set of 5204 
records used to verify the predictions obtained. For the BNN, an additional run 
was carried out on a random sub-sample of 1000 records from the training set, 
with the full test set used for predictions. 

Bayesian Neural Network (BNN) The data were modelled using a multilayer 
perceptron with seven input units, one for each data input, Xni (z = l,...,7), 
one hidden layer of J tanh units (J = 1,2, 3, 4 or 8), and a single output unit 
representing daughter first milk yield. The BNN model m was expressed as: 

jjn — b ^ ) Uj/ij(xyi), hji^Xji'j — tanh I o,j 4- ^ ( UijXni 

j=l V i=l 



( 1 ) 
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Fig. 1. Graphical representation of the BNN model with seven input units, one 
hidden layer of one, two, three, four or eight tanh units, and a single output unit 



In this model, uij represented the weight on the connection from input i unit to 
hidden unit j, Vj represented the weight on the connection from hidden unit j to 
the single output unit, and aj and b represented the biases of the hidden units and 
output unit respectively. The prior distributions ini used for the network param- 
eters (uij,aj,Vj) were taken to be normally distributed N(0,lo), with standard 
deviations tu considered to be hyperparameters with inverse Gamma, IG{a,f3), 
distributions. Three network hyperparameters with distributions /G(0.05, 0.5) 
were specified, one for the input-to-hidden weights, one for the hidden unit 
biases, Oj, and one for the hidden-to-output weights, vj. The output unit bias b 
was given a simple Gaussian prior A^(0, 100). The value of the output unit was 
taken as the mean of a Gaussian distribution for the target, with an associated 
error term (or network ’’noise”) having a Gaussian prior N{0,a). The standard 
deviation of this error term was controlled using a hyperparameter with distri- 
bution /G(0.05, 0.5). Figure GI provides a graphical representation of the BNN. 
Here, circular and oval shapes depict variables (data or parameters), double- 
edged boxes depict constants, dashed arrows define deterministic relationships 
and solid arrows show stochastic relationships. For example, the three dashed 
lines directed toward reflect the deterministic equation (1) defining for 
which jjn is an estimate. 

The model was implemented using MGMG methods to sample from the pos- 
terior distribution. In the initial state of the simulation, each of the hyperparam- 
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eters (the input-to-hidden weights, the hidden unit biases, the hidden-to-output 
weights and the ’’noise” hyperparameters) was given a value of 0.5. The network 
parameters (uij,aj,Vj) were given initial values of zero. Predictions were based 
on the final 80 of 100 iterations, the first 20 being discarded as ”burn-in”. The 
software used to implement the BNN, and the particulars of the implementation 
used, are described by Neal El. 

Artificial Neural Network (NN) The NN model was as described by equation 
(1) except that the tanh function was replaced by the logistic function. The data 
was modelled using a multilayer feed forward NN. Five NN architectures with 
a single hidden layer of J units, (J = 1,2, 3, 4 or 7), with logistic activation 
functions were trained ten times with different weight initialisations for 20,000 
epochs and tested on the test set every 50 epochs using a learning rate of 0.03 
and zero momentum. The average of the ten weight initialisations is reported 
for each network. The software employed for NN is Tlearn HS|. 

Linear Regression (LR) In the LR model, ijn = a + X)J=i PiXni with a con- 
stant and the (3i{i = 1, . . . , 7) representing the covariates of interest, as outlined 
in section 2.1. The software used to describe the LR was SPSS for Windows©, 
Version 8.0. 

Bayesian Linear Regression (BLR) The BLR model used is as described 
for the LR, with priors expressed as a ~ A(0, 100), Pi ~ A(0, 100) and tr^ 
distributed as /G(0. 001, 0.001). MCMC methods were used to carry out the 
necessary numerical integrations using simulation. A number of preliminary trial 
runs were carried out on the training set from various starting values and differing 
numbers of ” burn-in” iterations. The final simulation was initiated with values 
of zero for a and the Pi, and equal to one. An initial run of 7000 iterations was 
generated as ” burn-in”. Parameters of interest (a,P,a^) were then monitored 
and 15000 more iterations were performed, giving a total of 22000 iterations 
performed. A file containing a series of values simulated from the joint posterior 
of the unknown quantities was used to monitor for convergence. Diagnostics, 
including output analysis and summary statistics, indicated convergence had 
occurred after 15000 iterations. The software employed was BUGS®, Bayesian 
Inference Using Gibbs Sampling, version 0.50 m 

3 Results 

The predictions of the different methods on test data were compared using cor- 
relations, mean error, root mean square error, and absolute error. The predicted 
value in all cases is daughter first milk yield but error values cited are for the 
linearly transformed data as explained in Section 2.1. 

3.1 Bayesian Neural Network (BNN) 

Table 1 shows the correlations between the target values of daughter first milk 
yield and the values predicted by the BNN, mean error, RMSE, mean abso- 
lute error, and the percentage of target values falling within the 80% predic- 
tion quantile for network architectures of one, two, three, four and eight hidden 
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Table 1. Results for the BNN model. Note that the errors cited here are for 
a (— 1,+1) transformation and as such are twice that which would be expected 
for a (0,1) transformation used with the NN, BLR and LR models (see Tables 0 
and 0 below) . 



Number of Hidden Units 


1 


2 


3 


4 


8 


Correlation Coeffs. 


0.7642 


0.7645 


0.7649 


0.7650 


0.7647 


Mean Error 


-0.0001 


0.0000 


0.0000 


0.0000 


0.0002 


RMSE 


0.1216 


0.1216 


0.1212 


0.1212 


0.1212 


Mean Abs Error 


0.0950 


0.0949 


0.0949 


0.0949 


0.0949 


% Targets in 10% to 90% 
prediction quantiles 


82.80% 


83.00% 


82.74% 


82.65% 


82.80% 



units. These results indicate that the architecture using four hidden units gives 
a slightly higher correlation than that of the other architectures, although the 
difference is not significant. Note that the errors cited in Table 1 are for a (-1,+1) 
transformation (used with the BNN model), and as such are twice that which 
would be expected using a (0,1) transformation (used with the NN, BLR and 
LR models). Taking this factor into account, the errors referred to in Table 1 
are identical to the errors in Tables 2 and 3 (the results for the NN, BLR and 
LR models below). 

In order to briefly investigate the effect of a smaller training set on the 
performance of the BNN, an additional training run using three hidden units 
was performed on a random unbiased sample of 1000 records from the training 
set, the results of which were then analysed on the full test set. The correlation 
between the target value of daughter first milk yield and the predicted value 
for this experiment was 0.7591 (cf. 0.7649 for the full training set) whilst the 
percentage of target values falling within the 80% prediction quantile was 82.51% 
(cf. 82.74% for the full training set). 



3.2 Artificial Neural Networks (NN) 

Table 2 shows the correlations between the target value of daughter first milk 
yield and the value predicted by the NN for network architectures of one, two, 
three, four and seven hidden units, as well as the error components for the 
predictions. The correlation coefficient for the three hidden units is slightly better 
than that for the other architectures, although this difference is not significant. 



3.3 Linear Regression (LR) 

All seven covariates of interest were employed for LR. However, xj, representing 
dam season of calving (spring) was eliminated by SPSS due to the dependency 
generated in the data from the binary encoding of the seasonal effects. The final 
regression equation was: 
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Table 2. Correlations between the target values of daughter first milk yield 
and the predicted values, the mean of the error, RMSE, the mean absolute error 
for NN. 



Number of Hidden Units 


1 


2 


3 


4 


7 


Correlation 


0.7641 


0.7645 


0.7646 


0.7645 


0.7645 


Mean Error 


0.0060 


0.0060 


0.0040 


0.0050 


0.0050 


RMSE 


0.0607 


0.0606 


0.0605 


0.0606 


0.0605 


Mean Abs Error 


0.0480 


0.0480 


0.0480 


0.0480 


0.0480 



y = -0.0206 + 0.4390x„i + 0.2170a;„2 + 0.0972a;„3 + 0.0125x„4 

+0.0149a;„5 + 0.0011a;„6 (2) 

3.4 Bayesian Linear Regression (BLR) 

All seven covariates of interest were employed for BLR. The resultant regression 
equation was: 



y = -0.0159 + 0.4397a;„i + 0.2171a;„2 + 0.0973x„3 + 0.0063a;„4 

+0.0089x„5 - 0.0050x„ 6 - 0.0059a;„7 (3) 

Both types of linear regressions are compared in Table 3. Once again, the 
two methods gave statistically similar correlation coefficients. Note that both 
Bayesian and frequentist linear regressions arrived at predictive equations (2) 
and (3) with similar constants and coefficients for the dam, sire and herd effects. 
Some degree of variation occurred in the coefficients of the seasonal effects, 
which can be explained by the omission of the variable XnT, dam season of calv- 
ing (spring), by the LR. The BLR, having been supplied with all variables, made 
adjustments in the coefficients of the other seasonal variables to allow for the 
inclusion of Xni- It is also of interest to note that despite the other indicators be- 
ing equivalent, there was some increase in the mean error of the BLR compared 
with LR. In the MCMC simulation it was observed that the main parameters 
of interest (a;i, a; 2 , 3 : 3 , corresponding to herd, dam and sire effects) were very 
stable, whilst the seasonal parameters {x 4 to xy) were inclined to wander, with 
the constant a also inclined to wander, presumably to counteract the effect of 
the seasonal digressions. This may have occurred due to the over-specification 
of the BLR model with the inclusion of Xf. 

4 Discussion 

Four main conclusions arise from the studies carried out in this paper. Firstly, 
all four approaches are equivalent in terms of their predictive accuracy on the 
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Table 3. Correlations between the target values of daughter first milk yield 
and the predicted values, the mean of the error, RMSE, the mean absolute error 
for LR and BLR. 





Correlation 


Mean Error 


RMSE 


Mean Abs Error 


LR 


0.7642 


0.0001 


0.0608 


0.0475 


BLR 


0.7642 


0.0002 


0.0608 


0.0475 



dairy data, as indicated by a test of correlation coefficients between target and 
predicted value of daughter first milk yield. 

Secondly, because of the similarities between the results for all four methods, 
there is little or no non-linearity in the dairy data. The initial data exploration 
using SLR and PC A as outlined in Section 2.1 indicated the presence of a possibly 
non-linear seasonal effect in the data. Some non-linearity in the data may have 
been removed during the pre-processing stage with the age adjustment for dam 
second milk yield. The small amount of remaining non-linearity did not impact 
greatly on the results, which were not significantly different for all four methods. 

Thirdly, the predictive power of the BNN trained on the random subset of 
1000 records was not significantly different from that of the network trained 
on the full training set, as indicated by a comparison of correlation coefficients. 
Moreover, the reduction in training time for the network was dramatic. This 
finding is significant, not only in terms of clock time necessary to train the 
network, but also in terms of the amount of data necessary for training. This 
raises the question of optimal required training set size for both NN and BNN, 
and further investigation is necessary to determine this. 

Fourthly, it is apparent from this study that each different method provided 
its own particular advantages. On the one hand, both linear regression meth- 
ods, LR and BLR, provide straightforward descriptions of the relationship being 
portrayed, as well as some quantification of the importance of each input at- 
tribute, by way of the respective predictive equations (2) and (3). On the other 
hand, both neural network approaches, BNN and NN, have indicated a possible 
non-linearity in the data by the slight preference for three or four hidden units 
seen in Tables 1 and 2. Similarly, an advantage of both Bayesian approaches is 
some quantification of uncertainty in prediction m with the BNN predictions 
expressed in terms of probabilities with credible intervals calculated, plus the 
added insight gained by the depiction of the Bayesian model using a graphi- 
cal representation (see Figure Q. Conversely, an advantage of the non-Bayesian 
approaches, LR and NN, is their relative ease of implementation using readily 
available software, as opposed to the emerging techniques of both Bayesian ap- 
proaches, whose implementations were made more difficult by software packages 
in the early stages of development and refinement. 

This study has developed on the work of previous researchers in exploring 
the potential advantages of BNN over traditional NN learning. It is anticipated a 
more complete investigation will be undertaken to assess the effect of decreasing 
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the amount of training data m on the predictive capability of NN and BNN, 
and the resulting decrease in the training time necessary. A further investigation 
of the other potential advantages of BNN including the model fitting aspect HI, 
the accommodation of missing values 0 , and the avoidance of over-fitting when 
using a large network Hani, is also foreseen. 
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Abstract. We present KD-DT, an algorithm that uses a decision-tree- 
inspired measure to build a kd-tree for low cost nearest-neighbor searches. 
The algorithm starts with a “standard” kd-tree and uses searches over 
a training set to evaluate and improve the structure of the kd-tree. In 
particular, the algorithm builds a tree that better insures that a query 
and its nearest neighbors will be in the same subtree(s), thus reducing 
the cost of subsequent search. 



1 Introduction 

Kd-trees (0, 0, cni) support efficient nearest-neighbor searches for tasks such 
as instance-based reasoning (e.g., a, 0, El). 

KD-DT uses a decision-tree-inspired measure to build a kd-tree with lower 
cost nearest-neighbor searches than the “standard” approach of 0. 

The algorithm starts with a “standard” kd-tree and uses searches over a 
training set to evaluate and improve the structure of the kd-tree. We test our 
approach in a nutrition database of 11,697 food instances, each instance de- 
scribed by 56 continuously-valued nutritional components (e.g., protein content 
and fat content) (El> C3)- 

Section 2 provides background information on kd-trees. Section 3 presents 
the algorithm with the decision-tree-inspired measure, and Section 4 discusses 
our experiments with this algorithm. Finally, Sections 5 and 6 describe related 
and future work, respectively. 



2 Kd- Trees 

Binary kd-trees organize k-dimensional data for efficient search of nearest neigh- 
bors. Internal nodes split dimensions using a threshold, thus partitioning the 
data, and each leaf lists the instances that satisfy the conditions implied by the 
path to the leaf. 



D.J. Hand, J.N. Kok, M.R. Berthold (Eds.): IDA’99, LNCS 1642, pp. 407-|^^| 1999. 
[fc Springer- Verlag Berlin Heidelberg 1999 
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A 




Fig. 1. A Simple Kd-Tree 



Figure 1 shows a simple kd-tree on the following data points: (-10, 5) (-5,3) 
(0,1) (3,2) (4,12) (8,14). 

Inputs to a kd-tree search are a “query” instance, a neighborhood size, and a 
kd-tree. For example, using the tree in Figure 1, we might request the 2 nearest 
neighbors of (-7, 4). 

Two lists help in the search. One list keeps track of nearest neighbors seen 
thus far along with their distances from the query. The other list keeps track of 
the geometric boundaries of the current node. These boundaries are defined by 
the split point values at all of the node’s ancestors. For the tree in Figure 1, the 
bounds on the root, node A, are (— oo < atti < oo, — oo < att 2 < oo) and the 
bounds for node F are (0 < atti < oo, — oo < att 2 < 12). All data points that 
lie within the bounds of a node are in the subtree of that node. 

The list of nearest neighbors is initialized to contain the necessary number 
of instance slots with all the associated distances set to oo. 

The query instance is passed to a node. If that node is not a leaf, the query 
is compared to the node’s split point and the appropriate child is identified. The 
boundaries list is updated, and the recursive call to the child is made. If the 
node is a leaf, the algorithm computes the distance of each datum at the leaf to 
the query instance and compares this distance to the farthest distance in the list 
of neighbors. The list of closest neighbors thus far is updated whenever a closer 
element is found. 

Our example begins with: 

Initial boundaries list: (— oo < atti < oo, — oo < att 2 < oo) 

Initial neighbor list: {null, oo; null, oo) 

The query starts at the root, passes through node B, and reaches a leaf at 
node D. After searching D’s data, the nearest neighbor list is ((-10, 5), 3.16; 
(-5, 3), 2.24). 

After a node has been completely searched, the algorithm uses the boundary 
information to decide if it is done searching. If all the boundaries are a least as 
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far away from the query as the farthest near neighbor, the algorithm can stop 
because every point outside that boundary is too far away to be of interest, and 
every point inside that boundary has already been examined. 

In our example, the farther neighbor is 3.16 units away, but according to the 
current boundary conditions, (-00 < atti < —5, —00 < att 2 < 00 ), there is a 
boundary only 2 units from our query. 

Thus, the algorithm backtracks to the previous node and checks to see if the 
other child needs to be searched. The boundaries of the unexamined child are 
compared to the distance, d, to the most distant near neighbor. If the space 
defined by the bounds of that node intersects the region of space within d units 
of the query, the node needs to be searched because there could be a closer value 
within that node’s subtree. 

In our example, the boundary for the unexplored child, node E, are (—5 < 
att\ < 0, —00 < att 2 < 00 ). We have a neighbor farther away than the bound- 
ary, atti = 5. Therefore, we search the node, but do not find any items close 
enough to update the neighbor list. E’s bounds do not completely enclose the 
area containing possible near neighbors. So, we continue our search. 

Since all of B’s children have now been searched, the check to see if we 
can quit is performed again. The current boundaries are (—00 < atti < 0, 
— 00 < att 2 < 00 ). All of those boundaries are more than 3.16 units from our 
query. Thus, our search stops and returns the current list of neighbors. 

The fundamental operation during a kd-tree search is checking the distance 
from the query instance to a point along a single dimension which happens at 
varying times, during top-down traversal at each split point, at the leaves, and 
during backtracking. 

A kd-tree learning algorithm 0 recursively partitions the data by the median 
of the dimension with the greatest range. The “longest” dimension is always split 
in an attempt to keep each node’s geometric region as compact as possible. 

The user specifies the number of elements allowed in a leaf node. The algo- 
rithm uses this threshold to decide when to stop partitioning a node. 

The tree is Figure 1 was constructed in this manner over (-10,5) (-5,3) (0,1) 
(3,2) (4,12) (8,14). 

Attribute 1, with a range of [-10, 8], has a broader spread than attribute 2. 
Thus, its median value, 0, is chosen as the first split point. The instances are 
then partitioned, and the process repeats until leaves smaller than the specified 
threshold (e.g., 2) are formed. 



3 KD-DT 

For each kd-tree node, the “standard” algorithm examines each attribute in 
the search space and selects the median of the attribute with the broadest range 
to be the split point. This produces a balanced tree with informative splits that 
allow efficient searches. Ideally, the search would be able to proceed directly to 
the leaf containing all the nearest neighbors and complete its search without any 
backtracking. 
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Backtracking occurs when the nearest neighbors of the search item are sep- 
arated by one of the split points in the kd-tree. KD-DT uses a decision-tree 
inspired measure to more directly measure the likelihood of backtracking given 
a particular split point. 

Given an initial “standard” kd-tree, KD-DT identifies the nearest neighbors 
for every item in a training data set. It then builds a new tree, starting at the 
root, by examining all possible split points for each attribute at the current node. 
For each side of the split point, KD-DT measures the increase in likelihood that 
a training item’s nearest neighbors lie on that side of the split point given that 
the item itself lies on that side of the split point over the likelihood that the 
nearest neighbors lie on that side of the split point without that condition. KD- 
DT then computes a weighted average of the scores for both sides and weights 
that score by the probability that the attribute’s value is observed in the query. 

The potential split point with the greatest score is selected. The training data 
are partitioned accordingly, and this process repeats at the child nodes until the 
number of food instances at a node fall below a user-specified threshold. 

KD-DT ’s measure is: 

ScoresPi = P{a training instance has a value for Ai) * 

[P{Tj^ < SP^)[P{NNj, < SP,\T,, < SP^f - P{NNi < SP,)^] + 
P{Tj, > SP,)[P{NNj, > SP,\Tj, > SP,f - P{NN, > SPif]] , 



where 

— Ai is the attribute currently under consideration, 

— S Pi the split point along Ai currently under consideration, 

— Tji is the value for attribute Ai for the training (query) instance, 

— NNji is value for attribute Ai for a nearest neighbor of the training 
(query) instance, 

— P{a training instance has a value for Ai) is the probability that a training 
(query) instance has a value for its attribute, 

— P(Tji < SPi) is the probability that the value of the attribute for a 
training instance is less than or equal to the split point currently under 
consideration, 

— P{NNi < SPi) is the probability that the value of the z*^ attribute for a 
nearest neighbor is less than or equal to the value of the considered split 
point, 

— P{NNji < SPi\Tji < SPi) is the probability that the value of the z*^ at- 
tribute for a nearest neighbor of a training (query) instance is less than or 
equal to the value of the split point currently under consideration given that 
the value of the z*^ attribute for the training instance is less than or equal 
to the value of the considered split point. 

Inspired by the Gini-index used for building decision trees, this measure 
collectively assesses the extent that a query instance and its nearest neighbors 
will lie on the same side of a split point. The split point that maximizes the 
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measure is chosen to represent the attribute. The attribute with the maximum 
score is chosen to split the data. 

Finally, KD-DT handles missing values differently than the “standard” algo- 
rithm. When a food instance is missing a value needed to compute its distance 
from a query, KD-DT uses the (global) mean value of that dimension as an es- 
timate, and during tree construction, when a food instance is missing a value 
needed to decide which subtree the instance belongs in, KD-DT puts the food 
instances in both subtrees. Since KD-DT estimates missing values during search, 
this insures that no potential neighbor is missed during the search. 

4 Experiments 

To test KD-DT we ordered our food instances alphabetically and split them 
into 10 equal subsets. For each subset, we performed the following 5-fold cross 
validation test. We randomly split the subset into five equal groups. For each 
combination of four partition elements, we built a standard kd-tree and measured 
its average search cost for finding 10 nearest-neighbors for each item in the fifth 
(test) group. We used a maximum leaf size of 10 food instances. Then we searched 
the tree for the 10 nearest-neighbors for each food instance in the four (training) 
groups used to build the tree. For each training instance we processed this way, 
we recorded its 10 nearest neighbors. Once completed, we have collected a set of 
ordered pairs: a training instance along with its 10 nearest neighbors, for each 
instance. We built a new kd-tree using the measure from Section 3 with this 
(ordered-pair) training data and repeated the search for the 10 nearest-neighbors 
for each item in the fifth (test) set using the new kd-tree. 

Thus each experimental fold measured the cost of finding nearest neighbors 
using the standard tree over a test set, with the cost of finding nearest neighbors 
using an alternative tree constructed as described in Section 3. Cost was mea- 
sured in terms of the number of single dimension distance calculations, which as 
mentioned in Section 2, is the fundamental operation of kd-tree search. 

Table [Dshows the individual results for these experiments, and tableElshows 
the average results over all 10 5-fold cross validations. 

On average, the nearest neighbors searches in the KD-DT trees was 48% 
“cheaper” than in the standard kd-trees. Typical cost reduction was around 40- 
45%. Set 5 showed the least improvement (26%), and set 10 showed the best 
improvement (64%). 



5 Related Work 

KD-DT uses more information about the distribution of the search space and its 
nearest neighbors to build a kd-tree than the standard approach of [B|. It may, 
however, be the case that user queries exhibit an entirely different distribution 
than the data (e.g., USDA database) itself. For example, user queries might fo- 
cus on a small subset of the attributes rather than the entire set. In our nutrition 
domain, this could occur when a user is focused on a specific nutritional concern. 
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Table 1. Individual 5-fold cross validation tests 





Default tree 


“Trained” tree 


Set 1 


14479.62 

(7599.74) 


9204.44 

(6595.23) 


Set 2 


13727.56 

(8659.75) 


7704.59 

(5691.19) 


Set 3 


14542.01 

(8067.75) 


9534.22 

(6341.54) 


Set 4 


11432.87 

(7586.69) 


6997.28 

(6015.32) 


Set 5 


17335.92 

(7751.10) 


12854.28 

(6111.17) 


Set 6 


33788.87 

(22038.56) 


14270.89 

(9288.01) 


Set 7 


13661.31 

(9785.98) 


7583.69 

(6369.20) 


Set 8 


13580.47 

(8657.28) 


8582.13 

(6979.09) 


Set 9 


32000.21 

(17745.07) 


16199.83 

(8230.35) 


Set 10 


42690.15 

(36097.41) 


15410.24 

(12659.85) 



Table 2. Summary of experiments 



Default tree 


Trained tree 


20723.90 

(13398.93) 


10834.16 

(7428.10) 



For example, a diabetic might only be interested in foods with high protein and 
low carbohydrates. In ini, we present OPT-KD, an algorithm that exploits in- 
formation about (simulated-user) query distributions to improve kd-tree search 
efficiency under such conditions, by excluding those attributes that are not com- 
monly part of user queries. Thus, m reexpresses “the data” subsequent to new 
tree construction, but does not otherwise exploit query distribution information. 

There are also other kd-tree algorithms that use different split point selec- 
tion techniques. m uses the mean of the “broadest” attribute as the split point 
rather than the median, and whereas |BI and UDI partition the data at a node 
using a plane orthogonal to an attribute, im uses matrix computations to se- 
lect a non-orthogonal partition plane. VP-trees US] decompose the search space 
spherically instead of rectangularly as kd-trees do. Another nearest-neighbor 
approach from computational geometry is Voronoi diagrams ( 0 > 0 ). 
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Machine learning has been also been applied to diet management in the 
CAMPER system P|. This system combines case-based reasoning techniques 
with a rule-based reasoning technique to use information about an individual’s 
nutritional needs and personal preferences to suggest a daily menu for that 
person. 

6 Future Work 

We are very interested in the contrast between the data over which a kd-tree is 
constructed (e.g., records from the USDA database) and the queries of this data 
by a particular user or a population of users (e.g., find foods nearest neighbors to 
a high-protein, low- fat food such as tuna). These queries may themselves serve 
as data, but with very different distributional properties than the data to which 
they are being applied. Split measures like the one that we have introduced 
in this paper can be applied in conjunction with either kind of data set. Future 
work will investigate ways of exploiting the distribution of both populations, data 
and user queries, in building customized kd-trees that reduce nearest-neighbor 
searches for a particular user or group of users. 

Since we are using the search space itself as the training set, a natural exten- 
sion to our current work would be a incremental kd-tree construction algorithm 
that incorporated queried items into the search space while learning how to 
organize the tree. 

We are examining the effects of training on a subset of the search space 
rather than on the entire data set as well as the effects of searching for different 
numbers of neighbors during the training phase. 

The relationship of our work to other nearest-neighbor techniques and to 
applications that rely on nearest-neighbor search (e.g., instance-based reasoning) 
needs to be examined as well. 

Finally, our handling of missing values is simplistic, and other techniques 
need to be examined. We are exploring other value estimation techniques, and 
we are considering adopting Aha’s IGNORE technique p. We also are exploring 
methods to quantify the utility of these techniques. 
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Abstract. For good classification preprocessing is a key step. Good pre- 
processing reduces the noise in the data and retains most information 
needed for classification. Poor preprocessing on the other hand can make 
classification almost impossible. In this paper we evaluate several feature 
extraction methods in a special type of outlier detection problem, ma- 
chine fault detection. We will consider measurements on water pumps 
under both normal and abnormal conditions. We use a novel data de- 
scription method, called the Support Vector Data Description, to get an 
indication of the complexity of the normal class in this data set and how 
well it is expected to be distinguishable from the abnormal data. 



1 Introduction 

For good classification the preprocessing of the data is a important step. Good 
preprocessing reduces the noise in the data and retains as much of the informa- 
tion as possible Q. When the number of objects in the training set is too small 
for the number of features used (the feature space is under sampled), most clas- 
sification procedures cannot find good classification boundaries. This is called 
the curse of dimensionality (see for an extended explanation 0). By good 
preprocessing the number of features per object can be reduced such that the 
classification problem can be solved. 

A particular type of preprocessing is feature selection. In feature selection 
one tries to find the optimal feature set from a already given set of features |S| . 
In general this set is very large. To compare different feature sets, a criterion has 
to be defined. Often very simple criteria are used for judging the quality of the 
feature set or the difficulty of the data set. See [2] for a list of different measures. 

Sometimes we encounter a special type of classification problems, so-called 
outlier detection or data domain description problems. In data domain descrip- 
tion the goal is to accurately describe one class of objects, the target class, 
as opposed to a wide range of other objects which are not of interest or are 
considered outliers |Zj. This last class is therefore called the outlier class. Many 
standard pattern recognition methods are not well equipped to handle this type 
of problem; they require complete descriptions for both classes. Especially when 
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Fig. 1. Data description of a small data set, (left) normal spherical description, 
(right) description using a Gaussian kernel. 



the outlier class is very diverse and ill-sampled, normal (two class) classifiers 
obtain very bad generalizations for this class. 

In this paper we will introduce a new method for data domain description, 
the Support Vector Data Description (SVDD). This method is inspired on the 
Support Vector Classifier by V. Vapnik P] and it defines a spherically shaped 
boundary with minimal volume around the target data set. Under some re- 
strictions, the spherically shaped data description can be made more flexible by 
replacing normal inner products by some kernel functions. This will be explained 
in more detail in section El 

In this paper we try to find the best representation of a data set such that the 
target class is optimally clustered and can be distinguished as best as possible 
from the outlier class. The data set here considered is vibration data recorded 
from a water pump. The target class contains recordings from the normal be- 
havior of the pump, while erroneous behaviour is placed in the outlier class. 
Different preprocessing methods will be applied to the recorded signals in order 
to find the optimal set of features. 

We will start with an explanation of the Support Vector Data Description in 
section El In section 0the origins of the vibration data will be explained and in 
section El we will discuss the different types of features extracted from this data 
set. In section O the results of the experiments are shown and we will conclude 
with conclusions in section 0 

2 Support Vector Data Description 

The Support Vector Data Description (SVDD) is the method which we will use 
to describe our data. It is inspired on the Support Vector Classifier of V. Vapnik 
(0, or for a more simple introduction |0j). The SVDD is explained in more detail 
in |5|, here we will just give a quick impression of the method. 



Pump Failure Detection Using Support Vector Data Descriptions 417 



The idea of the method is to find the sphere with minimal volume which 
contains all data. Assume we have data set containing N data objects, {xi,i = 
and the sphere is described by center a and radius R. We now try to 
minimize an error function containing the volume of the sphere. The constraints 
that objects are within the sphere are imposed by applying Lagrange multipliers: 

L{R, a, Oj) = R^ — ^ — {xj — 2aXi + a^)} (1) 

i 

with Lagrange multipliers > 0. This function has to be minimized with respect 
to R and a and maximized with respect to a^. 

Setting the partial derivatives of L to i? and a to zero, gives: 

Oj = 1 and a = ^ ‘ a^Xi (2) 

This shows that the center of the sphere a is a linear combination of the data 
objects Xi- 

Resubstituting these values in the Lagrangian gives to maximize with respect 
to a^: 




L = Oi^{xi ■ Xi) — aiaj{xi ■ xj) ( 3 ) 

i i,j 

with a, > 0, = 1- 

This function should be maximized with respect to a^. In practice this means 
that a large fraction of the become zero. For a small fraction a, > 0 and these 
objects are called Support Objects. We see that the center of the sphere depends 
just on the few support objects, objects with = 0 can be disregarded. 

Object z is accepted when: 

(z -a){z- a)^ = {z-J2 

i i 

= (z • z) - 2^0j(z • a;*) + Y^aia^{xi ■ Xj) < R^ (4) 

i i,i 

In general this does not give a very tight description. Analogous to the 
method of Vapnik [ 0 ], we can replace the inner products {x ■ y) in equations 
m and in m by kernel functions K{x,y) which gives a much more flexible 
method. When we replace the inner products by Gaussian kernels for instance, 
we obtain: 

(x-y) ^ K{x,y) =exp{-{x-y)^/s'^) ( 5 ) 

Equation m now changes into: 

L = 1 - ~'^a,a^K{x^,Xj) 

i i=/=j 



( 6 ) 
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and and the formula to check if a new object z is within the sphere (equation 
( 0 ) becomes: 



We obtain a more flexible description than the rigid sphere description. In 
figure 0 both methods are shown applied on the same two dimensional data set. 
The sphere description on the left includes all objects, but is by no means very 
tight. It includes large areas of the feature space where no target patterns are 
present. In the right figure the data description using Gaussian kernels is shown, 
and it clearly gives a superior description. No large empty areas are included, 
what minimizes the change of accepting outlier patterns. To obtain this tighter 
description, one training object in the right upper corner is rejected from the 
description. 

This Gaussian kernel contains one extra free parameter, the width parameter 
s in the kernel (equation (0). As shown in jS] this parameter can be determined 
by setting a priori the maximal allowed rejection rate of the target set, i.e. the 
error on the target set. Applying leave-one-out estimation on the training set 
shows that non-support objects will be accepted by the SVDD when they are 
left out of the training set, while support object will be rejected. Therefore the 
error on the target set can be estimated by the fraction of training objects that 
become support objects in the data description: 



where #SV is the number of support vectors. 

In |2| it is shown that equation El is a good estimate of the error on the 
target class. The fact that the fraction support objects can immediately be used 
for the estimate of the error on the target class, makes this data description 
method a very efficient one with respect to the number of objects needed. Because 
independent test data is not necessary, all available data can immediately used 
for estimating the SVDD. 

Note we cannot set a priori restrictions on the error on the outlier class. In 
general we only have a good representation of the target class and the outlier 
class is per definition everything else. 

3 Machine Vibration Analysis 

Vibration was measured on a small pump in an experimental setting and on two 
identical pump sets in pumping station “Buma” at Lemmer. One of the pumps 
in the pumping station showed severe gear damage (pitting, i.e. surface cracking 
due to unequal load and wear) whereas the other pump showed no significant 
damage. Both pumps of the pumping station have similar power consumption, 
age and amount of running hours. 




A[P(error)] = 



( 8 ) 
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The Delft test rig comprises a small submersible pump, which can be made 
to run at several speeds (from 46 to 54 Hz) and several loads (by closing a 
membrane controlling the water flow). A number of faults were induced to this 
pump: loose foundation, imbalance, failure in the outer race of the uppermost 
ball bearing. Both normal and faulty behaviour was measured at several speeds 
and loads. 

In both set-ups accelerometers were used to measure the vibration near dif- 
ferent structural elements of the machine (shaft, gears, bearings). Features from 
several channels were collected as seperate samples in the the same feature space, 

i.e. inclusion of several channels increases the sample size, not the feature dimen- 
sionality. By putting the measurements of the different sensors into one data set, 
the data set increases in size, but information on the exact position of an indi- 
vidual measurement is lost. 

For the Lemmer measurements three feature sets were constructed by joining 
different sensor measurements into one set: 

1. one radial channel near the place of heavy pitting (expected to be a good 
feature), 

2. two radial channels near both heavy and moderate pitting along with an 
(unbalance sensitive) axial channel, and 

3. inclusion of all channels (except for the sensor near the outgoing shaft which 
might be too sensitive to non- fault related vibration). 

As a reference dataset, we constructed a high-resolution logarithmic power 
spectrum estimation (512 bins), normalized w.r.t. mean and standard devia- 
tion and its linear projection (using Principal Components Analysis) on a 10- 
dimensional subspace. Three channels were included, expected to be roughly 
comparable to the second configuration previously described. 

For the Delft dataset the same procedure was followed: the first set con- 
tains data from one channel near a fault location, the second set contains three 
channels near fault bearings and the third set contains all five channels. 

4 Features for Machine Diagnostics 

We compared several methods for feature extraction from vibration data. It 
is well known that faults in rotating machines will be visible in the accelera- 
tion spectrum as increased harmonics of running speed or presence of sidebands 
around characteristic (structure-related) frequencies. Due to overlap in series of 
harmonic components and noise, high spectral resolution may be required for 
adequate fault identification. This may lead to difficulties because of the curse 
of dimensionality: one needs large sample sizes in high-dimensional spaces in 
order to avoid overfitting of the train set. Hence we focused on relatively low 
feature dimensionality (64) and compared the following features: 

power spectrum: standard power spectrum estimation, using Welch’s aver- 
aged periodogram method. Data is normalized to the mean prior to spec- 
trum estimation, and feature vectors (consisting of spectral amplitudes) are 
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normalized w.r.t. mean and standard deviation (in order to retain only sen- 
sitivity to the spectrum shape). 

envelope spectrum: a measurement time series was demodulated using the 
Hilbert transform, and from this cleaned signal (supposedly containing in- 
formation on periodic impulsive behavior) a spectrum was determined using 
the above method. Prior to demodulation a bandpass-filtering in the inter- 
val 125 - 250 Hz (using a wavelet decomposition with Daubechies- wavelets of 
order 4) was performed: gear mesh frequencies will be present in this band 
and impulses due to pitting are expected to be present as sidebands. For 
comparison, this pre- filtering step was left out in another data set. 
autoregressive modelliug: another way to use second-order correlation infor- 
mation as a feature is to model the timeseries with an autoregressive model 
(AR-model). For comparison with other features, an AR(64)-model was used 
(which seemed sufficient to extract all information) and model coefficients 
were used as features. 

MUSIC spectrum estimatiou: if a time series can be modeled as a model of 
sinusoids plus noise, we can use a MUSIC frequency estimator to focus on the 
important spectral components 0. A statistic can be computed that tends 
to infinity when a signal vector belongs to the so-called signal subspace. 
When one expects amplitudes at a finite number of discrete frequencies to 
be a discriminant indicator, MUSIC features may enable good separability 
while keeping feature size (relatively) small, 
some classical indicators: three typical indicators for machine wear are 

— rms- value of the power spectrum 

— kurtosis of the signal distribution 

— crest-factor of the vibration signal 

The first feature is just the average amount of energy in the vibration signal 
(square root of mean of squared amplitudes). Kurtosis is the 4‘^ central 
moment of a distribution, that measures the ’peakedness’ of a distribution. 
Gaussian distributions will have kurtosis near 0 whereas distributions with 
heavy tails (e.g. in the presence of impulses in the time signal) will show 
larger values. The crest-factor of a vibration signal is defined as the peak 
amplitude value divided by the root-mean-square amplitude value (both from 
the envelope detected time signal). This feature will be sensitive to sudden 
defect bursts, while the mean (or: rms-) value of the signal has not changed 
significantly. 

5 Experiments 

To compare the different feature sets the SVDD is applied to all target data sets. 
Because also test objects from the outlier class are available (i.e. the fault class 
defined by the pump exhibiting pitting, see section 01, the rejection performance 
on the outlier set can also be measured. 

In all experiments we have used the SVDD with a Gaussian kernel. For each 
of the feature sets we have optimized the width parameter s in the SVDD such 
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that 1%, 5%, 10%, 25% and 50% of the target objects will be rejected, so for each 
data set and each target error another width parameter s is obtained. For each 
feature set this gives an acceptance-rejection curve for the target and the outlier 
class. 

We will start with considering the Lemmer data set with the third sensor 
combination (see section[3) which contains all sensor measurements. In this case 
we do not use prior knowledge about where the sensors are placed and which 
sensor might contain most useful information. 





Fig. 2. Accept ance/rejection performance of the SVDD on the features on the 
Lemmer data set, with all sensor measurements collected. 



In figure 0the characteristic of the SVDD on this data is shown. If we look at 
the results for the power spectrum using 512 bins (see left figure E|) we see that 
for all target acception levels we can reject 100% of the outlier class. This is the 
ideal behavior we are looking for in a data description method and it shows that 
in principle the target class can be distinguished from the outlier class very well. 
Drawback in this representation though is that each object contains 512 power 
spectrum bins, it is both expensive to calculate this large a Fourier spectrum and 
expensive in storage costs. That is why we try other, smaller representations. 

Reducing this 512 bin spectrum to just 10 features by applying a Principal 
Component Analysis (PCA) and retaining the ten directions with the largest 
variations, we see that still we can perfectly reject the outlier class. 

Looking at the results of the classical method and the classical method using 
bandpass filtering, we see that the target class and the outlier class overlap 
significantly. When we try to accept 95% of the target class only 10% or less 
is rejected by the SVDD. Also considerable overlap between the target and the 
outlier class is present when envelope spectra are used. When 5-10% of the 
target class is rejected, still about 50% of the outlier class is accepted. Here also 
bandpass filtering does not improve the performance very much, only for large 
target acceptance rates, the bandpass filtering is useful. 
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Finally in the right figure |2| the results for the MUSIC estimator and the 
AR-model features are shown. The results on the AR-model feature set and the 
MUSIC frequency estimation feature are superior to all other methods, with 
the AR model somewhat better than the MUSIC estimator. The AR model ap- 
proximates almost the 512 bin power spectrum, only for very large acceptance 
rates of the target class, we see some patterns from the outlier class being ac- 
cepted. Applying the SVDD on the first three principal components deteriorates 
the performance of the MUSIC estimator. The AR model still performs almost 
optimal. 





Fig. 3. Accept ance/rejection performance of the SVDD on the features on the 
Delft data set, with all sensor measurements collected. 



In figure 0 similar figures are shown for the Delft measurements. Looking 
at the performance of the 512 bin power spectrum, we see that here already 
considerable overlap between the target and the outlier class exist. This indicates 
that this problem is more difficult than the Lemmer data set. The performances 
by the different features do not vary much, the MUSIC estimator, AR-model and 
the envelope spectrum perform about equal. In all cases there is considerable 
overlap between target and outlier class. This might indicate that for one (or 
more) of the outlier classes the characteristics are almost equal to the target 
class characteristics (and thus it is hard to speak of an outlier class). 

The analysis was done on a data set in which all sensor information was 
used. Next we look at the performance of the first and the second combination 
of sensors in the Lemmer data set. In figure 0 the performance of the SVDD is 
shown on all feature sets applied on sensor combination (1) (on the left) and 
combination (2) (on the right). Here also the classical features perform poorly. 
The envelope spectrum works reasonably well, but both the MUSIC frequency 
estimator and the AR-model features perform perfectly. The data from sensor 
combination (1) is clearly better clustered than sensor combination (3). Only 
the AR model features and the Envelope detection with bandpass filtering on 
the single sensor data set shows reasonable performance. 
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Fig. 4. Acceptance/rejection performance of the SVDD on the different features 
for sensor combination (1) and (2) in the Lemmer data set. 




Fig. 5. Acceptance/rejection performance of the SVDD on the different features 
for sensor combination (1) and (2) in the Delft data set. 



We can observe the same trend in figureEl where the performances are plotted 
for sensor combination (1) and (2) in the Delft data set. Here also the MUSIC 
estimator and the AR model outperform the other types of features, but there are 
large errors, which can be expected considering the complexity of this problem. 

6 Conclusion 

In this paper we tried to find the best representation of a data set such that 
the target class can best be distinguished from the outlier class. This is done by 
applying the Support Vector Data Description, a method which finds the smallest 
sphere containing all target data. We applied the SVDD in a machine diagnostics 
problem, where the normal working situation of a pump in a pumping station 
(Lemmer data) and a pump in an experimental setting (Delft data) should be 
distinguished from abnormal behavior. 
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In this application data was recorded from several vibration sensors on a ro- 
tating machine. Three different subsets of the sensor channels were put together 
to create new data sets and several features were calculated from the time sig- 
nals. Although the three sensor combinations show somewhat different results, 
a clear trend is visible. 

As a reference a very high resolution power spectrum (512 bins) is used. 
In the case of the Lemmer measurements with this spectrum it is possible to 
perfectly distinguish between normal and abnormal situations, which means that 
in principle perfect classification is possible. In the case of Delft measurements, 
the distinction between target and outlier class becomes more difficult. This is 
probably caused by the fact that the Delft measurements contain more outlier 
situations which are very hard to distinguish from the normal class. 

Performance of both MUSIC- and AR- features was usually very good in all 
three configuration data sets, but worst in the second configuration and best in 
the third configuration. This can be understood as follows: the sensors under- 
lying configuration 2 are a subset of the sensors in configuration 3. Since the 
performance curves are based on percentages accepted and rejected, this per- 
formance may be enhanced by adding new points to a dataset (e.g. in going 
from configuration 2 to 3) that would be correctly classified according to the 
existing description. The sensor underlying the first configuration was close to 
the main source of vibration (the gear with heavy pitting), which explains the 
good performance on that dataset. From the results it is clear that there is quite 
some variation in discrimination power of different channels, but also that in 
this specific application inclusion of all available channels as separate samples 
can be used to enhance robustness of the method. 

In the three dimensional classical feature set both classes severely overlap 
and can hardly be distinguished. This can be caused by the fact that one of the 
classical features is kurtosis, whose estimate shows large variance. Increasing the 
time signal over which the kurtosis is estimated might improve performance, but 
this would require long measurement runs. 

When all sensor combinations and both Lemmer and Delft data sets are 
considered, the AR-model allows for the tightest description of the normal class 
when compared with all other features. We can conclude that if we want to use 
shorter representations of vibration data to overcome the curse of dimensionality, 
the AR-model is best choice. 
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Abstract. One of the most challenging problems in econometrics is 
the prediction of turning points in financial time series. We compare 
ARM A- and Vector- Autoregressive (VAR-) models by examining their 
abilities to predict turning points in monthly time series. An approach 
proposed by WeckerpQ and enhanced by KlingQ forms the basis to ex- 
plicitly incorporate uncertainty in the forecasts by producing probabilis- 
tic statements for turning points. To allow for possible structural change 
within the time period under investigation, we conduct Data Mining by 
using rolling regressions over a fix-sized window. For each datapoint a 
multitude of models is estimated. The models are evaluated by an eco- 
nomic performance criterion, the Sharpe-Ratio, and a testing procedure 
for its statistical significance developed by Jobson/Korkie[S|. We find 
that ARMA-models seem to be valuable forecasting tools for predicting 
turning points, whereas the performance of the VAR-models is disap- 
pointing. 



1 Introduction 

Facing the task of forecasting with a quantitative model, an economist usually 
estimates a single model to produce point forecasts. Thereby the uncertainty in- 
herent in any kind of forecast is neglected: ’’The generation of a forecast is of no 
great practical value if some measure of the uncertainty of that forecast cannot 
also be provided.” In this paper, our intention is to explicitly incorporate this 
uncertainty into probabilistic statements for turning points in monthly finan- 
cial time series. We implement a Monte-Carlo-based regression introduced by 
WeckerfP and enhanced by Kling^, which is described in section 2. To decide 
which models (ARMA or VAR) perform better we take the view of a partici- 
pant in the financial markets. Here one is not interested to optimize statistical 
criteria, like Mean Squared Error etc., but in an acceptable profit for the taken 
risk. The Sharpe-Ratio is a performance criterion which allows to relate the 
profits to the risk of an investment. In section 3 we briefly review its basics 
and a test for statistical significance of the Sharpe Ratio of two investments. 
Section 4 describes our Data Mining approach to out-of-sample model selection 



D.J. Hand, J.N. Kok, M.R. Berthold (Eds.): IDA’99, LNCS 1642, pp. 427-^^^ 1999. 
(c) Springer- Verlag Berlin Heidelberg 1999 
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with the rolling regressions and the research design to compare the ARMA- and 
VAR-models. Section 5 presents empirical results and concludes. 

2 Probabilistic Statements for Turning Points in Time 
Series 



As a first step to obtain a probabilistic statement about a near-by turning point 
one has to define a rule when a turning point in the time series is detected. The 
turning point indicator 



1, if a peak occurs at time t 
0, otherwise 



( 1 ) 



is defined as a local extreme value of a certain amount of preceding and succeed- 
ing datapoints: 



1, if xt > xt+i, i = -T, -T+1,.. -1, 1, . . . , r - 1, r 
0, otherwise 



(2) 



The trough indicator zj' is defined in an analogous wayQ As we investigate 
monthly time series, we define r = 2. Choosing t = 1 would result in a model 
too sensitive to smaller movements of the time series, whereas with r > 2 the 
model would react with inacceptable delay. At time t the economist knows only 
the current and past datapoints Xt,Xt~i, ■ ■ ■ ,Xt-T+i,Xt-T- The future values 
Xt+i, ■ ■ . ,Xt+T-i,Xt+T have to be estimated. Since the turning point indicators 
, respectively zf , , are functions of the future datapoints 

A(_|_i, At_|_ 2 , . . ., they are random variables. Using econometric models and the 
known Xt,Xt-i, ■■■,Xt-T+i,Xt-T, one can estimate Xt+i, ■ ■ ■ ,Xt+T-i,Xt+T- Here 
it becomes clear that the ability for the detection of turning points critically 
depends on the forecasting model. The time series (Xt) could be generated by 
a univariate autoregressive process. In this case the following model is adequate 
to describe the true data generating process (DGP)fl 



xt+i — /?o + PiXt + l32Xt-i -l- . . . -I- PaXt-R+i + €t+i 



(3) 



where (3i are the regression coefficients, R is the order of the AR process, and et+i 
is a white noise disturbance term. Using optimisation techniques, such as Ordi- 
nary Least Squares, a model can be estimated from the data so that E[f3i\ = (3i. 
The model reflects the supposed DGP. The standard deviations cr^. of the esti- 
mated regression coefficients and the standard deviation of the distur- 
bance term are measures of the uncertainty of the forecast by the model and 
can be used to judge the ability of the model to mimic the true DGP. High tr^ 



^ Furtheron we do not explicitly distinguish between peaks and troughs. 

^ The simple AR-process only serves for illustration purposes. More complex processes 
(e.g. VAR-, non-linear processes) could be relevant as well. The time series (Xt) here 
is meant to decribe a scalar. Generalisations to vector notation are straightforward. 
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resp. correspond to high uncertainty, whereas low values of cr^ resp. 
mean low uncertainty. After estimating Xt+i, . . . , Xt+r the indicators and zj 
can be computed. We are interested in a statement for the next turning point, 
so we define 

(4) 

where k is such that = 1 and = 0,j < k. Verbally interpreted wf 
expresses the number of periods until the next turning point. The following 6- 
step Monte-Carlo procedure can be used to derive probabilistic statements for 
near-by turning points: 

1. Draw random numbers /3i(l), /32(f), .■• from a multivariate normal 

distribution with mean vectoi0 (/3i, /32, . . . , and empirical variance- 
covariance matrix of the regression coefficients. 

2. Draw a random number et-y(l) from a univariate normal distribution with 
mean 0 and variance erf cl 

3. Compute Xt+i(l) = /3o(l)a;t-|-^i(l)a;t_i-|-. . -|-et+i(l). If r > 

1, draw ei_|_ 2 (l), . . . , q+t( 1) and iterate step 3 to obtain i(+ 2 (l), . . . , it+T-(l). 

4. Compute and . . . , 

5. Compute wf (1) and wj (1). 

6. Repeat steps 1 to 5 times. 

The predictive distributions Pf and for a near-by turning point can be 
approximated by the empirical distributions (1), . . . , wf (N) and wf (1), . . . , 
wf{N). As an example take the following table derived from a Monte-Carlo- 
Simulation with = 10 drawings: There are three entries for a turning point in 



Table 1. Example for determination of the turning point probabilities 



n 


1 


2 


3 


4 


5 


6 


7 


8 


9 


10 


Wt (n) 


3 


0 


2 


1 


2 


2 


1 


2 


2 


1 



t -I- 1 : wf (4) = 1, wf (7) = 1, wf (10) = 1. It follows that Pf{Wf‘ = 1) = ,^ = 
0.3. The probabilities for one period are characterised by Pf+P^ < 1- A turning 
point ist detected, if Pf reaches or exceeds a certain threshold 9, e.g. 9 = 0.5. 
Summarizing section 2, we explicitly incorporate uncertainty of the forecasts by 
producing probabilistic statements for near-by turning points. Furthermore, by 
not only considering one single model, but a family of N models, our forecasts 
are more reliable than those of single models. To compare the ARM A and VAR 
models, we need a measure of performance we base our decision on. It is discussed 
in the next section. 

® The exponent T symbolizes transposition of the vector. 

* Kling[2|, p. 212, estimates the turning points with a VAR model. He draws the 
residual vector from a multivariate normal distribution with mean 0 and empirical 
variance-covariance matrix of the residuals. 
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3 Evaluation of Performance 

The econometric model classes (ARMA and VAR) applied in this paper have 
to be evaluated concerning their task to forecast turning points. Hence it does 
not make sense in this context to rely on error measures of function approxi- 
mation, such as MSE, MAE, etc. A participant in the financial markets usu- 
ally is not interested in function approximation but in economic performance. 
Unfortunately error measures of function approximation show little coherence 
with trading profits |^. Criteria especially developed to evaluate a model’s abil- 
ity to forecast turning points were developed, amongst others, by Brier|^ and 
Diebold/Rudebusch[7|. But those performance measures are similar to the er- 
ror measures and a statistical test of significance is not available. One of our 
main goals in this study is evaluation based on economic criteria such as profits 
from a trading strategy. Since our models do not produce return forecasts but 
probabilities for turning points, we have to measure performance indirectly by 
generating trading signals from those probabilities: A short position is taken 
when a peak is detected {Pf > 9, implying that the market will fall, trading 
signal s=-l), a long position in the case of a trough {P^ > 9, market will rise, 
s=-|-l), and the position of the previous period is maintained if there is no turn- 
ing point. One possibility to evaluate the quality of the trading signal forecasts is 
to count the number of correct forecasts and relate it to the number of incorrect 
ones. Unfortunately, this type of performance measurement does not discrim- 
inate between trading signals attributed to large and small movements of the 
time series. A participant in the financial markets usually is interested to get 
the large movements rather than the small ones. Correctly predicting the large 
movements corresponds to the idea of maximising a profit-oriented criterion. 
The Sharpe-Ratio, which is briefly described in the following, is such a criterion. 
With the actual period-to-period return r actual, t we can calculate the return 
from a turning point forecast of our model: 

— -5 * T actual, t (h) 

Subtracting the risk-free rate of interest rf^T from the average return of the model 
= T Tm,t over T periods yields the average excess return e-m = rm—i"f,T- 
The Sharpe-Ratio SR0 relates Cm to the return’s standard deviation am' 

SR^— ( 6 ) 

The Sharpe-Ratio measures the excess return a model produces for a unit of risk 
the model takes. Jobson/Korkie|5] developed a test for the null hypothesis 



Ho : SRm - SRbm = 0 



(7) 
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of no significant difference between the Sharpe-Ratios of a forecasting model and 
a benchmarkj^ By re-arranging (7) and relating it to the variance 5 of the two 
Sharpe-RatioO Jobson/Korkie find that the test statistic 

Cm • CFBM — BbM ■ CTm /o\ 

Zm,BM = 

asymptotically follows a standard normal distribution and is powerful in mod- 
erately large samples. A positive and statistically significant Zm,BM means that 
the model outperforms the benchmark in terms of the Sharpe-Ratio. The next 
section shows how the Jobson/Korkie-test was used within our Data Mining 
approach. 






4 Data Mining in Financial Time Series 



To test the ability of the ARMA and VAR models to predict turning points, we 
investigated the logarithms of nine financial time series, namely DMDOLLAR, 
YENDOLLAR, BDIOY (performance index for the 10 year German government 
benchmark bond), USIOY, JPlOY, MSWGR (performance index for the German 
stock market), MSUSA, MSJPA, and the GRB-Index. The data was available 
in monthly periodicity from 83.12 to 97.12, equalling 169 datapoints. One was 
lost because of differencing. 100 datapoints were used to estimate the models, so 
that we can base our decision which model class performs better on 68 out-of- 
sample forecasts. To allow for the possibility of structural change in the data, we 
implemented rolling regressions: After estimating the models with the first 100 
datapoints and forecasting the succeeding datapoints, the data-” window” of the 
fixed size of 100 datapoints was put forth for one period and the estimation pro- 
cedure was repeated. We estimated a multitude of models for each model class: 
15 ARMA-models from (1,0), (0,1), (1,1),..., to (3,3) and 3 VAR models VAR(l), 
(2), and (3). We do not specify a model and estimate all rolling regressions with 
this model. Rather we specify a class of models (ARMA and VAR). Within a 
class the best model is selected for forecasting. As an extreme case, a different 
model specification could be chosen for every datapoint (within the ARMA class 
e.g. the ARMA(1,0) model for the first rolling regression, ARMA(2,2) for the 
second etc.). This model selection procedure is purely data-driven, so it can be 
regarded as Data Mining. Since it is well known that in-sample evaluation is a 
poor approximation for out-of-sample performance da, reliable model selection 
has to be based on true out-of-sample validation. Therefore we divided the data 
in three subsequent, disjunct parts: a training subset (70 datapoints), a valida- 
tion subset (30 datapoints), and a forecast subset (t = 2 datapoints, see Fig. 



5 



7 



Jobson/KorkieEl developed a test statistic which allows to compare the Sharpe- 
Ratios of more than two portfolios simultaneously, too. 



122 122 

% 



-(c 



r symbolizes the number of periods that have to be forecast in order to make a 
turning point decision at time t, cf. section 2. 
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Fig. 1. Division of the database 



The first 70 datapoints from t-99 to t-30 were used to estimate the models, 
which were validated with respect to their abilities to predict turning points on 
the following 30 datapoints from t-29 to t. This is true out-of-sample validation of 
the models, because at time t-30 the datapoints from t-29 to t are unknown. For 
each model and each datapoint in each rolling regression fV=200 Monte-Carlo- 
simulations in order to calculate the turning point probabilities were performed. 
If the model at the beginning of the validation period in t-29 decided ”no turn- 
ing point” (results in maintenance of the previous period’s trading signal), there 
is no trading signal originally stemming from the model. In this case we used 
the last trading signal which could be produced with certainty. With r = 2 
the last certain signal for a turning point can be generated for t-31. Then the 
best model on the validation subset was selected. For each of the two model 
sequences (ARMA and VAR) only one model was selected at each time. The 
specification of this model, e.g. ARMA(2,2), then was re-estimated with the 100 
datapoints from t-99 to t to forecast the at time t unknown r values of the 
time series, which are necessary to decide whether there is a turning point at 
time t. The decision which model is the ’’best” was made with respect to the 
Jobson/Korkie-test on the difference between the Sharpe-Ratio of the models 
and a benchmark. Thus each of the multitude of models of each sequence had 
to be tested against a benchmark. Since our goal is the comparison of the two 
competing classes ARMA vs. VAR models, each sequence has to consist of rep- 
resentatives of this model class. The simplest model of each class served as a 
benchmark in the statistical tests: for the ARMA-sequence the benchmark was 
the (l,0)-model, in the other case the VAR(l)-model. Using the Jobson/Korkie- 
test as a criterion for selection, each of the multitude of ARMA models was 
tested against the (1,0) model. If e.g. the ARMA(2,2) model could reject the 
null hypothesis Zm,BM = 0 with a significantly positive Zm,BM on the validation 
subset (t-29 to t), this model specification was selected and re-estimated with 
t-99 to t to forecast the r unknown values of the time series for t + 1, + t. If 

more than one model qualified, the one with the highest SR was selected. If the 
null could not be rejected, the forecasts were conducted with the (l,0)-model. 
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Table 2. ARMA- and VAR-sequence as an example for the rolling regressions 











ARMA 


VAR 


RR 


training 


validation 


forecast 


Spec. 


Zm , B M 


Spec. 


Zm,BM 


1 


84.1-89.10 


89.11-92.4 


92.5-92.6 


(1.0) 


* 


(3) 


3.3563 

(.0008) 


2 


84.2-89.11 


89.12-92.5 


92.6-92.7 


(2,2) 


1.9823 

(.0475) 


(1) 


* 


















68 


89.8-95.4 


95.5-97.10 


97.11-97.12 


(3,0) 


2.1486 

(.0317) 


(3) 


1.9987 

(.0456) 



This procedure was implemented for the VAR-sequence in an analogous way. 
As we solely rely on economic performance to select the best model, we do not 
consider statistical criteria, like t-values etc. The variances resp. standard devi- 
ations of the regression coefficients are needed for the drawings of the random 
numbers to incorporate the uncertainty in the forecasts. Badly fitted models 
with high variability in the coefficients and according high variances will not 
be able to detect the relevant turning points in the validation subset and so be 
disqualified in the selection procedure. Two sequences with a threshold 9 = 0.5 
and a significance level of 0.1 could look like Table 2. The first four columns 
refer to the number of the rolling regressions and the training, validation, and 
forecast period, respectively. The 5th (7th) column gives the specification of the 
selected ARMA (VAR) model, the 6th (8th) column gives Zm,BM (the entry (*) 
in the column " Zm,BM’’'’ means that in this period no model qualifies against the 
benchmark; p- values in parenthesis below the Zm,SM-value): 

The first turning point forecast was done for 92.4 (with the unknown values 
of 92.5 and 92.6), the last for 97.10. The primary objective of this paper is to 
make a statement about the relative performance of ARMA- vs. VAR-models 
to detect turning points in time series. We created two model sequences with a 
sample size of 68 forecasts each. In order to produce a statistically significant 
result, we compare the ARMA sequence with the VAR sequence. Therefore we 
compute the Sharpe-Ratios of the ARMA excess returns (SRarma) and the 
VAR excess returns (SRvar) over the 68 out-of-sample rolling regression fore- 
casts. Comparing those two Sharpe-Ratios with the Jobson/Korkie test statistic 
ZARMA,VAR (thereby using SRvar as the benchmark) allows to make a state- 
ment which model class performs better on the 68 datapoints. A result might 
be that zarma,var > 0 and statistically significant (e.g. zarma,var = 1-96 is 
significant at the 5%-level). In that case the conclusion is that ARMA models 
outperform VAR modelsjl This does not mean that ARMA models are valuable 
forecasting tools. In order to be sure that ARMA models in this example are 
valuable forecasting tools, one would like to test if this model class is able to 



In the case zarma.var < 0 VAR-models outperform ARMA-models. 
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outperform a simple benchmark as well. When forecasting economic time series, 
a simple benchmark is the naive forecast. The naive forecast uses the status of 
the current period to forecast the next period. In the context of return forecasts, 
the return of the current period is extrapolated as a forecast of the next period. 
As we deal with turning point predictions and do not produce explicit return but 
trading signal forecasts derived from turning point predictions, the extrapolation 
of returns is not adequate. One could think of using the actual return from t-1 
to t as an indicator for the trading signal of the future period: a past positive 
return means a trading signal +1 for the future period, a negative return means 
a trading signal -1. One goal of turning point predictions is to detect the longer 
term trend reversals. Using the past return as a benchmark is more adequate for 
short-term, period-to-period forecasts. Hence we need a naive benchmark which 
works in a similar way as our model and thereby reflects the idea of turning 
point forecasts!! Using the last certain turning point statement can be regarded 
as a benchmark in this sense. As t = 2, the last certain turning point statement 
can be made for t — 2, using the datapoints from t — 4 to t. A valuable fore- 
casting model should be able to outperform this Naive Turning Point Forecast 
(NTPF), so it is straightforward to test the ARMA- and VAR-sequences against 
the NTPF. E.g. a significantly positive zarma,ntpf {zvar,ntpf) calculated 
with the Sharpe-Ratios for the ARMA- (VAR-) and NTPF-sequences over the 
68 out-of-sample forecasts implies that e.g. the ARMA- (VAR-) models are valu- 
able tools for forecasting turning points in financial time series. If zarma,ntpf 
is negative, the NTPF produces a higher Sharpe-Ratio over the 68 datapoints. 
The next section presents empirical results from the turning point forecasts with 
our Data Mining approach. 

5 Empirical Results 

The following Table 3 exhibits the results for the turning point forecasts with a 
significance level of SL=.l for the Jobson/Korkie-test and a threshold of 6* = .75. 
Results with different threshold values 9 = .5 and 9 — .95 showed that overall 
9 = .75 produces the best results of the model classes vs. NTPF. The lower the 
level of significance for the Jobson/Korkie-test, the higher is the required differ- 
ence between the two Sharpe-Ratios to be considered as statistically significant. 
With SL=.01, the sequences almost solely comprised the simplest (=benchmark) 
model (ARMA(1,0) resp. VAR(l)) in each model sequence. To a lesser degree 
this was valid for the turning point forecasts with a significance level SL=.l, 
whose results are presented in Table 3. Another point in this direction comes 
from the sample size of only 30 datapoints in the validation subset, which results 
in low discriminative power of the test. Hence the bias for the selection of the 
simplest model in each sequence is relatively high. The nine financial time series 
are considered as a closed market system, where every variable influences each 
other. Therefore the VAR models consisted of nine equations with lags of all 

® For a detailed discussion of the problem to select an adequate benchmark for turning 
point models see Poddig/Huber[TT|. p. 30ff. 
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Table 3. Empirical results 



ARMAvsVAR ARMAvsNTPF VARvsNTPF 



II 


SRarma 


SRvar 


SRntpf 


Z 


p{z) 


2 


p(z) 




p{z) 


MSWG 


.0124 


-.1419 


-.0038 


1.44 


.1488 




.8897 


BBH 


.3450 


MSUSA 


.1149 


-.0287 


.0341 


1.10 


.2703 


El 


.4774 


bei 


.6265 


MSJPA 


-.0398 


.1300 


-.0700 


-1.14 


.2544 




.8955 




.2862 


BDIOY 


-.2206 


-.3545 


-.2602 


1.39 


.1653 


S3 


.6402 




.4352 


USIOY 


-.1253 


-.0851 


-.1597 


-.36 


.7187 




.7855 


IBil 


.6178 


JPlOY 


.0883 


.0480 


-.0824 


.58 


.5613 




.1764 


QQ 


.2450 


DMDO 


-.2569 


-.3169 


-.3098 


.45 


.6520 


lESl 


.6977 


Btia 


.9636 


YEND 


-.1966 


-.1931 


.1015 


-.03 


.9755 


BEa 


.5992 


BEa 


.5778 


CRB 


-.3458 


-.4307 


-.4759 


.44 


.6597 




.2920 




.8132 



variables in the system and a constant. Table 3 exhibits in the first column the 
name of the time series under consideration. The three following columns show 
the values of the Sharpe-Ratios of the ARMA- and VAR-models, and the NTPF, 
respectively. For each change from a long- into a short-position transaction costs 
of 0.75% were subtracted. The 5th column and 6th column give the z-values 
for the Jobson/Korkie-test statistic and its corresponding p- value for the test of 
the ARMA- vs. the VAR-sequences. The two following columns present z- and 
p-values for the comparison of the ARMA- vs. NTPF-sequences. The 9th and 
10th column exhibit those values for the VAR- vs. NTPF-sequences. 

Looking at the results for e.g. MSWGR in detail, only the ARMA-models 
were able to produce a positive Sharpe-Ratio {SRarma = .0124) in the out- 
of-sample forecasts for the 68 months under consideration. This means that the 
ARMA-models on average reward each unit of risk, measured in standard devia- 
tions of the return, with an excess return of 1.24% per month. From the positive 
Z ARM A, VAR = 1-44 {p = .1488) for the nullhypothesis Hq : SRarma~SRvar = 
0 it can be seen that they outperformed VAR models, although not significantly 
at the usual levels (1% to 10%). ARMA-models outperformed the NTPF as well 
{zARMA,NTPF = -13, p = .8897). VAR-models underperformed even the NTPF, 
so they do not seem to be a valuable tool for forecasting turning points in 
the German stock market. ARMA-models managed to produce positive Sharpe- 
Ratios for three out of the nine markets (MSWGR, MSUSA, JPlOY), VAR 
(MSJPA, JPlOY) and NTPF only twice (MSUSA, YENDOLLAR). In six mar- 
kets ARMA-models outperformed VAR-models (positive ZARMA,VAR-values), 
and in all markets but YENDOLLAR they outperformed the NTPF (positive 
ZARMA,NTPF-val\xes), but not significantly. VAR-models manage to beat the 
NTPF only four times (positive zy AR,NTPF~vdl\xes for MSJPA, USIOY, JPlOY, 
GRB) . MSJPA is the only market in which VAR-models remarkably outperform 
ARMA-models {zarma,var = — l-14,p = .2544). In two more cases VAR at 
least produce higher Sharpe-Ratios {zarma,var negative for USIOY and YEN- 
DOLLAR). Only the Japanese stock market MSJPA could be predicted remark- 
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ably more successfully by VAR- than by ARMA-models and NTPF (MSJPA: 
zvAR,NTPF = ^-07', zarma,var = — 1-14. For JPlOY the VAR outperformed 
the NTPF, but so did the ARMA-models. The CRB-index seems to be unpre- 
dictable: none of the three sequences produced a positive Sharpe-Ratio, but the 
ARMA-(SR=-.3458) and VAR-models (SR=-.4307) still performed better than 
the NTPF (SR=-.4759). Summarizing the results in brief, it seems that ARMA- 
models are better tools for forecasting turning points in financial time series. 
In all but one case they managed to outperform the NTPF, although not in 
one single case statistically significant. The bad performance of the VAR-models 
might be due to their possible overparameterisation with nine variables and 
one to three lags. The simplest VAR(l)-model comprises nine variables plus a 
constant in each equation, which results in (9-1-1) -9=90 regression coefficients. 
Future research in the area of turning point forecasts will concentrate on smaller 
VAR-models and the derivation of portfolio weights from the turning point prob- 
abilities. This can be accomplished by ’’rewarding” forecasts with a high degree 
of certainty in the forecasts (e.g. Pf « 1 certainty of a peak, or Pf « 0 
certainty of no peak) and ’’penalizing” the ones with a low degree of certainty 
{Pf ^9). 
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Abstract. This paper describes a Memory-Based Reasoning applica- 
tion that generates candidate classifications to aid editors in allocating 
abstracts of judicial opinions among the 82,000 classes of a legal clas- 
sification scheme. Using a training collection of more than 20 million 
previously classified abstracts, the application provides ranked lists of 
candidate classifications for new abstracts. These lists proved to contain 
highly relevant classes and integrating this application into the edito- 
rial environment should materially improve the efficiency of the work of 
classifying the new abstracts. 



1 Introduction 

There is much research in the fast growing field of automated text classifica- 
tion in both the Information Retrieval (e.g. Pj) and the Machine Learning (e.g. 
PJ) communities. Text classification is all the more important for information 
providers and publishers generally. Classification systems have been applied to 
collections of news articles [lOigj . medical records 1 1 bl8[ . or the Web 0. De- 
spite the research in this area, reports of applications of automatic classification 
techniques in production environments are rare m 

This paper focuses on an application in the area of legal abstracts. West 
Group’s legal classification system (known as the Key Number system) is used 
to classify more than 350,000 abstracts per year among approximately 82,000 
separate classes. This work is performed manually by a staff of highly specialized 
attorney/editors. This application is designed to provide computer-aided support 
for this time-consuming task. 

The hypotheses upon which this application is based are: 

— a Memory-Based Reasoning approach utilizing the more than 20,000,000 
already existing classified abstracts can produce relevant candidate classi- 
fications for new abstracts and will improve the efficiency of classification 
editors, 

— in a Memory-Based Reasoning system [E! (a A:-nearest neighbor method), 
the accuracy of the candidate classifications is proportional to the amount 
of training data, and 

— a small number of neighbors (fc) is inappropriate given the nature of our 
data. 



D.J. Hand, J.N. Kok, M.R. Berthold (Eds.): IDA’99, LNCS 1642, pp. 437-|5^^ 1999. 
(c) Springer- Verlag Berlin Heidelberg 1999 
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Section 0 describes the problem. In Sect. 0 we briefly present our classiflca- 
tion approach. Information on our text collections is provided in Sect. 0 while 
Sect. 0 presents our evaluation strategy. Results are reported in Sect. 0 and 
discussed in Sect.0 

2 The Headnote Classification Task 

2.1 Key Number System - Background 

The American legal system is based in substantial part upon judicial precedent. 
That is, the pronouncements of judges in their written and published decisions 
declare the law and, theoretically at least, establish the law that will be followed 
in similar cases. Therefore, it is necessary that American lawyers and judges 
have access to the entire body of judicial opinions in order to determine what 
the law is in any given situation. 

West Publishing, a predecessor of West Group, first began publishing ju- 
dicial opinions in 1872, and its National Reporter System (NRS) now contains 
approximately 5 million published opinions from virtually every federal and state 
jurisdiction. Using its proprietary Key Number classification system. West cat- 
egorizes the points of law stated in judicial opinions. This system has been the 
principal tool for the location of judicial precedent since the turn of the century. 

At the top level, the Key Number system has over 400 topics. Topics are 
generally further subdivided with large subjects having a number of hierarchical 
layers (up to 8 levels deep). At the bottom of the hierarchical tree structure 
are the individual Key Numbers. There are approximately 82,000 of these Key 
Number categories, each one of which delineates a particular legal concept. 

2.2 Key Number Assignment 

Attorney /editors at West read each opinion to be published in the NRS and 
make individual abstracts for each point of law enunciated or discussed in the 
opinion. These abstracts, or headnotes as they are called, are then given a Key 
Number classification by experienced editors (see Fig. IDfor an example). The 
classification editors often examine classification of headnotes in cases cited in 
the opinion, and also use Westlaw(^| Boolean search techniques in attempting 
to achieve accuracy and consistency. 

About 350,000 headnotes are classified each year. These classified headnotes 
appear with case law documents on Westlaw@ and in print digest publications. 
To function properly as a reliable tool for finding precedent, these classifications 
must be consistent. With a legacy collection of 20 million classified headnotes, 
maintenance is also a priority. Changes in the law require modification of exist- 
ing topics and creation of new topics. Redefinition of existing concepts by the 
courts requires new classification of older material. Between the classification of 
new headnotes and the reclassification of legacy headnotes, a substantial and 
expensive manual effort is required. 



^ Westlaw@ is West Group’s computer-aided legal research system. 
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134 DIVORCE 

134V Alimony, Allowances, and Disposition of Property 

134k230 Permanent Alimony 

134k235 k. Discretion of court 

Abuse of discretion in award of maintenance occurs only where no reasonable 
person would take view adopted by trial court 



Fig. 1. Example of a headnote and its associated Key Number hierarchy 



2.3 The Computer- Assisted Task 

We are interested in a system that eases the burden of manually classifying 
the headnotes and increases the consistency of classification. Headnotes are con- 
sidered easy or difficult to classify. An easy classification is one for which no 
research is needed by the editor. Approximately 40 percent of headnotes are dif- 
ficult to classify, but an editor spends over 60 percent of his/her time classifying 
these headnotes. Thus, from a production standpoint, the system’s value lies in 
assisting classification of difficult headnotes. 

The system described below attempts to accomplish this task by locating 
legacy headnotes similar to new, unclassified headnotes and suggesting candi- 
date Key Numbers. It displays to editors lists of likely Key Numbers or hier- 
archically defined ranges of Key Numbers. The system also makes available for 
review the similar headnotes. This type of information is currently available only 
by running complicated Boolean searches on Westlaw@ and gleaning the Key 
Number information from the search results. In effect, the system emulates the 
kind of manual research that is done by editors to discover correct Key Numbers 
for newly created headnotes. 

Ideally, the system will rank the most likely Key Numbers highest. Further- 
more, it should be capable of displaying the list of Key Numbers in the context 
of the Key Number classification hierarchy. Doing so will direct editors to the 
neighborhood of relevant Key Numbers even if none of the Key Numbers dis- 
played is exactly applicable. If several key numbers are found under the same 
parent, relevant key numbers are likely to appear in the neighborhood even if 
the individual Key Numbers themselves are weak evidence. The system lists 
the most promising key numbers in a ranked order. The editors view the list of 
suggested Key Numbers and select the relevant Key Numbers. 

3 The Classification Algorithm 

We chose Memory-Based Reasoning, a variant of a /c-nearest neighbor method, 
as our core classification algorithm. That choice was motivated by the following: 
first, we wanted to leverage the existing huge collection of manually classified 
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headnotes (over 20 million); secondly, we needed an inductive method that could 
handle classification over 82,000 classes (Key Numbers); finally, we wished to get 
preliminary results with a minimum effort in development. 

Given a test instance, the /c-nearest neighbor classifier retrieves stored in- 
stances that are closest to the test instance with respect to some distance func- 
tion; it then outputs the most probable class, given the classes of the retrieved 
instances. Our A:-nearest neighbor approach relies on a full text retrieval engine. 
First, training documents are indexed. Then, a test instance is transformed into 
a structured query, and a search is run against the indexed collection. The result 
of that search is a ranked list of document-score pairs. A score corresponds to 
the similarity between the retrieved document and the test instance. We use 
this similarity score as the metric for finding the nearest neighbors. Finally, we 
extract the Key Numbers (classes) associated with the k top documents (i.e. the 
k nearest neighbors to the test instance), and rank those Key Numbers accord- 
ing to a scoring function. We rank Key Numbers as the system is intended to 
propose candidate classification for human review. 

We used research implementations of the indexing program at West Group 
and of the natural language version of Westlaw0(both referred to as NL-Westlaw 
in the remainder of this paper). NL-Westlaw is based on the work by H. Turtle 
HI. and is related to INQUERY, another implementation of the same theoretical 
model Pj. 



3.1 Collection Indexing 

Indexing involves the following steps: 

— tokenization: tokenization reads in documents, removes stop-words (in our 
case, there are 290 stop- words) and single digit^, stem^ terms using Porter 
stemming algorithm HU- 

— transaction generation: a transaction is a tuple grouping a term t, a document 
identifier n, the frequency of t in n, and the positions of t in n. 

— inverted file creation: an inverted file (see ^21 for instance) allows efficient 
access to term information at search time. Records in the inverted file store 
the term, the number of documents in the collection the term appears in, 
and the transactions created at the previous stage. 



3.2 Similarity Score as a Distance between Documents 

Two components are involved in retrieving documents strongly related to the 
query: concept resolution and search. Goncept resolution turns a natural lan- 
guage query (here a headnote) into a structured query and identifies concepts. 
Goncepts may be individual terms, or query operators and their operands. One 
example is the phrase operator, where operands are terms. Goncept resolution 

^ We removed single digits as they tended to appear as item markers in enumerations. 
® A word “stem” corresponds to the root of a word after removal of predefined affixes. 
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additionally gathers global information for each concept, such as the number 
of documents in the collection containing the concept. This information is then 
used during the search phase. 

The search process is responsible for computing the similarity between doc- 
uments in the collection and the query, given the identified concepts. Individual 
terms are scored using a tf-idf formula: 

w{t, d) = 0.4 -I- 0.6 X tf{t, d) X idf(t). 

The inverse document frequency factor {idf) favors terms that are rare in the 
collection, while the term frequency factor (t/) gives a higher importance to 
terms that are frequent in the document to be scored. In our current setting, we 
use: 



idf{t) = 



\og{N) -log{df{t)) 
log{N) 



and tf{t, d) = 0.5 -I- 



0.5 X log{f{t,d)) 
log(maxt/) 



where N is the total number of documents in the collection, df(t) is the number 
of documents where term t appears, /(t, d) is the number of occurences of term 
t in document d and maxt/ is the maximum frequency of a term in document 
d. 

More generally, concept scores are derived from the score of their operands 
and the scoring rule associated with the operator (see |2j). The similarity score 
between a document and the query is then obtained by averaging the scores of 
the concepts. In the end, the search returns a ranked list of documents and their 
associated scores. 



3.3 Ranking Key Numbers 

In order to display a list of candidate Key Numbers, we extract the Key Numbers 
assigned to the top documents in the search results. We then group the Key 
Numbers using a scoring function. We choose to experiment with the following 
scoring functions: 

— the raw frequency of a Key Number, i.e. the number of retrieved documents 
with that Key Number. We expect this function to yield ties frequently. 

— the sum of the similarity scores of retrieved documents assigned a given Key 
Number, in an effort to eliminate ties. 

— the sum of rank weights. These functions give a higher weight to a Key 
Number assigned to a document at the top of the retrieved set, and a lower 
weight when the document is at a lower position. We consider two functions, 
w(r) = r~^ and w(r) = (1 — re), where e = l/(fc -|- 1), k being the number 
of nearest neighbors. 

For each test instance, the system sorts the Key Numbers by their scores and 
displays the ranked list of Key Numbers. 
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4 The Corpus 

4.1 Document Collections 

We constructed two collections of headnotes from the databases available at 
West. The first one, hnotes-all, contains all headnotes ever written and classified. 
The second one, hnotes-25, is a subset of hnotes-all and contains headnotes writ- 
ten and classified in the past 25 years. As Tabled indicates, hnotes-25 represents 
42.81% of the headnotes in hnotes-all. 



Table 1. Summary of the headnotes collections 



Collection 


Number of headnotes 


Number of unique 
terms indexed 


Total number of 
terms indexed 


hnotes-all 

hnotes-25 


20,481,882 

8,767,630 


308,112 

188,516 


759,928,514 

345,348,582 



Although several Key Numbers can be assigned to a single headnote, 90% of 
the headnotes have a unique Key Number. When a headnote is assigned several 
Key Numbers, it becomes several training documents, each with a unique Key 
Number. Consequently, a simple 1-nearest neighbor approach is bound to fail. 

Also, topics and Key Numbers are not uniformly assigned to headnotes as is 
reflected by the noticeable discrepancy between the mean and median number 
of headnotes per topic and Key Numbers as shown in Table El 



Table 2. Summary of the distribution of headnotes per Key Number and topic 



Topics 




Collection 


Median 


Mean 


hnotes-all 

hnotes-25 


8,903 

2,956 


45,834.5 

21,661.1 



Key Numbers 




Collection 


Median 


Mean 


hnotes-all 


90 


224.2 


hnotes-25 


23 


121.9 



4.2 Test Collection 

The test collection is a set of 200 queries (recent headnotes) selected by an ex- 
perienced editor. The selection was constrained to yield a representative sample 
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of headnotes whose classification was judged as easy or difficult. Easy and diffi- 
cult to classify headnotes represent 60 and 40 percent of the editor’s work load, 
respectively. Thus, the queries were selected to reflect the ratio. 

Each query went through an automatic process that identified legal phrases, 
removed stopwords and stemmed terms, in a way similar to the indexing pro- 
gram. There was a wide variation in the length of the resulting queries. The 
shortest was 4 terms (including phrases) long, the longest was 86 terms long. 
The average length of a query was 24 terms and phrases. 

5 Evaluation Strategy 

Our task is not a usual classification task as we are not required to output a 
unique class. As an application of classification techniques, we report the accu- 
racy of our classifier when only the top candidate is returned. 

To properly evaluate our classifier given our requirements, we report the 
percentage of times we found the correct Key Numbers within the top n positions 
of the ranked list; we used the values 2, 3, 4, 5 and 10 for n. 

An editor would consider a support system that suggests relevant Key Num- 
bers for a large proportion of headnotes more valuable than a system that is 
somewhat more accurate but that succeeds on fewer headnotes. Therefore, we 
also report the percentage of queries where we failed to retrieve the relevant Key 
Numbers. 

Even if our classifier cannot find the correct Key Number, it would be con- 
sidered useful (see Section 12.311 if the editor was pointed in the right direction. 
To reflect the interest, we report all above measures at the topic level, i.e. the 
first level in the Key Number hierarchy. 



6 Results 

In our approach, only the number of neighbors and the scoring function have 
to be chosen empirically. We evaluated performance on both collections when 
k = b, k = IQ, k = 25, k = 50, k = 75, and k = 100. The scoring functions are 
those from Section im for Key Numbers and topics. 

Table 0 reports the percentage of queries for which the classifier failed to 
retrieve the correct Key Number or topic at any rank. This percentage depends 
on k, but is independent of the scoring function. The lower the number of re- 
trieved documents, the lower the probability that the editor will find the relevant 
assignment. We observe but a negligible difference between the two collections. 
In addition, the good performance at the topic and Key Number level suggests 
that the system will be a useful support tool. 

Table 0 reports, for both collections, the percentage of correct assignments 
within the first n candidates at the topic and Key Number level using the sum 
of similarity scores. We fixed k at 100, the best performing setting above. On 
average, the rank weight w{r) = 1 — re performed slightly better (by 1 or 2%) 
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Table 3. Percentage of queries for which the system failed to produce a correct 
answer when the number of neighbors k varies. 



Topics 




Collection 


k = 5 


fc = 10 


fc = 25 


fc = 50 


fc = 75 


o 

o 

II 


hnotes-all 


7.5% 


6% 


2.5% 


2% 


1.5 % 


1.5% 


hnotes-25 


7% 


5% 


2.5% 


2.5% 


1.5% 


1.0% 



Key Numbers 




Collection 


k = 5 


A: = 10 


it = 25 


A: = 50 


k = 75 


k = 100 


hnotes-all 

hnotes-25 


31.5% 

33.0% 


24.5% 

23.5% 


18.0% 

15.5% 


11.0% 

12.0% 


9.5% 
9 % 


8.0% 

8.5% 



than the sum of similarity scores for Top 1, Top 2 and Top 3. However, there 
was no difference for Top 4, Top 5 and Top 10. The raw frequency was found 
as effective as the sum of similarity scores, except in cases of ties (4 at the Key 
Number level at Top 1) where a random choice picked out the incorrect answer. 
The rank weight w{r) = r~^ performed the worst: this function gives too much 
importance to the first retrieved headnote. Once again, the difference between 
the two collections was insignificant. 



Table 4. Percentage of correct assignments after the first n Key Numbers or 
topics. The number of neighbors k is set to 100. The scoring function is the sum 
of similarity scores. 



Topics 




Collection 


Top 1 


Top 2 


Top 3 


Top 4 


Top 5 


Top 10 


hnotes-all 

hnotes-25 


80.5% 

80.5% 


91.5% 

91.5% 


94.5% 

95.5% 


95.5% 

96.5% 


96.0% 

97.5% 


97.5% 

98.5% 



Key Numbers 




Collection 


Top 1 


Top 2 


Top 3 


Top 4 


Top 5 


Top 10 


hnotes-all 

hnotes-25 


50.0% 

48.5% 


63.5% 

61.0% 


69.5% 

67.5% 


77.0% 

73.5% 


78.0% 

76.0% 


83.5% 

83.5% 



The accuracy of the classifier is 50% at the Key Number level on hnotes-all 
and 48.5% on hnotes-25 (i.e. the percentage of correct answers at Top 1). We 
consider this result encouraging as the classifer has to pick among 82,000 Key 
Numbers. 

Finally, we studied how our classifier performed relative to the difficulty of 
headnote classification. Remember that our support system will be more valuable 
if it can provide meaningful help with the difficult headnotes. Table O breaks 
down the percentage of correct assignments at Top n between the easy and 
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difficult to classify queries. We present results on the hnotes-25 collection only 
as results were similar for hnotes-all. While our classifier performed better on the 
easy queries than the difficult ones, the number of correct assignments for the 
difficult queries appears more than sufficient to be valuable to editors. 



Table 5. Percentage of correct assignments on the hnotes-25 collection, broken 
down by ease of classification. The number of neigbors k is set to 100. The score 
function is the sum of similarity scores. 



Topics 




Difficulty 


Top 1 


Top 2 


Top 3 


Top 4 


Top 5 


Top 10 


easy 

difficult 


82.5% 

77.5% 


94.2% 

87.5% 


95.8% 

95.0% 


97.5% 

95.0% 


98.3% 

96.2% 


99.2 % 
97.5% 



Key Numbers 




Difficulty 


Top 1 


Top 2 


Top 3 


Top 4 


Top 5 


Top 10 


easy 

difficult 


57.5% 

35.0% 


70.0% 

47.5% 


74.2% 

57.5% 


80.8% 

62.5% 


82.5% 

66.2% 


88.3% 

76.2% 



Although a test collection of 200 queries is not small in information retrieval, 
it may be considered small by classification and machine learning standards. 
Consequently, the results presented here are meant to be more suggestive than 
conclusive. 



7 Discussion 

With the help of expert classifiers, we conducted a failure analysis of the 17 
queries for which the system failed to retrieve the correct Key Number. For 
10 out of these 17 queries, the system retrieved a Key Number related to the 
correct Key Number within the 5 top ranked candidates. In two of the test 
cases, headnotes had been assigned 2 Key Numbers manually. The fc-nearest 
neighbor approach retrieved the other Key Number in the top 5 candidates. 
Three queries were found to be incorrectly formatted. To sum up, the analysis 
showed that there were only 4 queries that could be considered true failures. 

It has been conjectured that the more documents there are, the better the 
results will be 0. However, using all 20 million headnotes did not improve 
performance over using 8.7 million headnotes. For each query, we compared 
the headnotes retrieved in both collections: 78% of the headnotes retrieved in 
hnotes-25 also appeared in the retrieved set from hnotes-all. Overall, 81.5% of 
the retrieved headnotes from hnotes-all were written in the past 25 years (the 
additional 3.5% results from a difference in term weights in the collections). 

Typical values for k in previous research range from 1 to 15. US! reported 
that a larger value of k (30 in her experiments) may be more suitable when 
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documents can appear multiple times in the training collection. We were able 
to use a larger number for k because of some characteristics of our training 
collection: its size and the average number of headnotes per Key Number. 

We believe that our system succeeded in proposing relevant Key Numbers 
since NL-Westlaw can choose from a very large number of headnotes. Further- 
more, we suspect that NL-Westlaw is successful because the queries and the 
documents are of the same length and style, which is unusual in information 
retrieval applications. 

The performance of the classifier leads us to believe that we are justified in 
going forward to deploy the support system in a production environment. The 
percentage of the queries with the correct Key Numbers in Top 10 is especially 
encouraging. The system found relevant topics and Key Numbers for the dif- 
ficult to classify headnotes almost at the same level as for the easy to classify 
headnotes. 

7.1 Related Work 

Nearest neighbor classification (see for an overview), also referred to as 
instance-based learning P or Memory-Based Reasoning PI, has been widely 
studied. Three papers reported applying a nearest neighbor approach to textual 
data. They used a full text retrieval engine to assess similarity between docu- 
ments to be classified and previously classified documents. Masand et al. P] first 
used Memory-Based Reasoning to classify news stories. Yang’s ExpNet PI is a 
nearest neighbor method applied to documents extracted from the MEDLINE 
database. Larkey and Croft 0 used a nearest neighbor classifier in combination 
with other classifiers to assign codes to inpatient discharge summary. Cohen’s 
WHIRL system can also be viewed as a /c-nearest neighbor technique, when used 
for classification ^ . WHIRL extends the join operator for relational databases 
to handle textual fields of varying length, and has been applied to data collected 
from the Web. 

Although we adopted a similar approach, we differ in terms of the size of 
the training collection (20 and 8.7 million documents) and the number of classes 
(82,000 Key Numbers), several orders of magnitude larger than the largest col- 
lection reported: m used a collection of around 50,000 news stories and 350 
classes, while jSl used 3,261 codes and about 11,600 documents. 

It is hazardous to compare two systems on two distinct problems. However, 
we find the performance of our classifier encouraging, when compared to that 
of 0. At the full code level (similar to our Key Number level), their fc-nearest 
neighbor classifier found the correct code for 38.5% of the test collection at Top 
1 and 72.2% at Top 10; at their higher level the performance was 55.1% at Top 
1 and 84.1% at Top 10. 

8 Conclusion 

We showed that with a minimum effort, using a full text retrieval engine, our 
Memory-Based Reasoning approach is successful at suggesting candidate Key 
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Numbers or hierarchically defined ranges of Key Numbers (here the top most 
informative node in the hierarchy, the topic). While we are confident that the 
performance of the classifier will increase the efficiency of the classification pro- 
cess, work remains to be done to integrate this system into the overall editorial 
environment. 

We used two test collections in this experiment. The first consisted of 20 
million headnotes (hnotes-all), the entire collection of abstracts. The second con- 
sisted of only those abstracts produced in the last 25 years (hnotes-25). We were 
somewhat surprised to discover that the accuracy of the candidate Key Num- 
bers produced by the system was as great with the smaller collection as with the 
larger. 

Our experiments showed that the Memory-Based Reasoning system worked 
best when the number of neighbors (k) was more than 50. This confirms our 
hypothesis that the nature of our data requires a larger value for k than is 
usually reported in the literature. 

In the future, we first plan to further reduce the size of the training collection 
to include only headnotes specific to the jurisdiction of the query. We believe 
that this would result in retrieval of even more similar headnotes. Another ad- 
vantage would be to reduce the processing time. Then we will investigate query 
modification, especially reducing the length of the queries based on some sta- 
tistical evidence. This will also allow us to reduce computation, as long queries 
require more processing. Finally we need to assess whether the system (with its 
graphical interface) will help increase classification consistency and accuracy of 
new or legacy headnotes. Additionally, a similar approach could be adopted in 
suggesting Key Numbers to be used for query enhancement for on-line customer 
queries. 
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Abstract. Today, there are many needs to replace or maintain many 
plants built in the 1960s. However, it is difficult to replace or maintain 
them, because the documents of installed sequential control logic are 
seldom remain. 

Therefore, we propose an automatic regeneration method(SPAIR) in or- 
der to solve this problem. SPAIR regenerates sequential control logic 
that is expressed on a ladder diagram from the input and output data 
of a target control unit and its supplementary specifications, which in- 
dicate the information about timers, etc. SPAIR consists of two parts, 
namely basic logic inferring engine and interior coil logic inferring en- 
gine. In basic logic inferring engine, time series data is compressed and 
translated into training data using the specihcations. The training data 
are processed by inductive learning and transformed into control logic. 
In the interior coil logic inferring, the target logic of the interior coil is 
acquired by the selective attachment of logic parts. 



1 Introduction 

To replace sequential control systems that are made up of vacuum tubes or 
relay circuits, the control logic that is installed in the control systems should 
be transform to the form of a ladder diagram for modern programmable logic 
controllers (PLCs). However, documents of control logic often no longer exist or 
most of the remaining documents are mostly incorrect, because of modifications 
of the control logic. For this reason, it is necessary to automatically extract the 
currently installed control logic from the information of the target logic that we 
can observe, for example, input-output data and other action specifications. 

Even though methods for the automatic programming of sequential control 
programs have been developedp^ |2j, they require complete and exact models and 
plant specifications. Therefore, these methods are not suitable for regenerating 
sequential control programs for operating plants. 

Therefore, We have proposed automatic regeneration of sequence programs 
for operating plants: Sequential control Program Automatic Inductive Regener- 

D.J. Hand, J.N. Kok, M.R. Berthold (Eds.): IDA’99, LNCS 1642, pp. 449-|4^2l 1999- 
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Fig. 1. SPAIR Outline. 



ation(SPAIR) method 0. SPAIR regenerates sequence programs by using induc- 
tive learning from input-output data of a controller (time series data) and action 
specifications. 

SPAIR consists of two engines, namely basic logic inferring engine and interior 
coil logic inferring engine. In the basic logic inferring engine, in order to apply 
inductive learning, training data, is needed. Therefore, target control logic is 
divided into “set” or “reset” conditions for each output signal. We have defined 
a training data format for learning each condition. 

In the regeneration of logic, interior coils, which do not appear in time series 
data, cause some problems. Generally, some logic variables which do not appear 
in output data, called “interior coils”, are used in some cases, for example, in 
recognizing action mode. If the target logic contains interior coils, it is impossible 
that SPAIR regenerates the target logic. Therefore, we proposed an efficient 
inference method for interior coil logic, called Logic Part Attaching Algorithm. 
In this method, the target logic of interior coil is acquired by repeated selective 
attachment of logic parts, which are points of contact in ladder diagram. This 
method contains two phases, which are the attaching phase and the selecting 
phase. In the attaching phase, some logic parts are attached to a current tentative 
logic in order to generate several candidate logic. In the selecting phase, the most 
promising logic is selected as the next tentative logic. 



2 Outline of SPAIR 

SPAIR regenerates a target plants’ logic. An outline of SPAIR is shown in Fig.[D 
The inputs of SPAIR are time series data patterns (input-output data) from the 
target plants, and action supplementary specification of the target plants, such 
as information about timers. In SPAIR, a basic logic inference engine is firstly 
applied. Then, if the target logic has interior logic, a interior coil logic inferring 
engine is performed. 
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3 Basic Logic Inferring Method 

The composition of the basic logic inferring engine of SPAIR is shown in Fig. 0 
The basic logic inferring algorithm consists of four phases: preprocessing, training 
data generating, inductive learning, and the ladder transforming phases. The 
descriptions of the details of each phase are followed. 

In the preprocessing phase, when the same data patterns continue in time 
series data that is accumulated from the controller, time series data is compressed 
and continuous time is added. The reduced data is called arranged time series 
data. The purpose of this function is to limit the quantity of the data and to 
reduce the time of inductive learning. When the time series data is accumulated 
from the objective plant, the arranged time series data is generated. 

In the training data generating phase, the training data sets of each target 
output logic are generated from the arranged time series data. A training data 
set consists of some attributes and a class. The training data set is applied to 
IDS, which is a typical type of inductive learning 0. When the value of the target 
output is changed, peculiar inputs and outputs are usually changed within the 
current time of the time series data. Therefore, the attributes of the training 
data set are made up of the current time data, and the class of training data is 
made up of the target output at the current time and at the previous time. 

The class of the training data set is divided into four classes; “set”, “reset”, 
“1-continue” and “0-continue” . If the target output on current time is I and 
the previous output is 0, the class of the training data set is “set” . Similarly, if 
the present output is 0 and the current output is 1, the class is “reset”. If both 
previous the output and current time output are 1 or 0, the class is “1-continue” 
or “0-continue,” respectively. 




Fig. 2. The Composition of the Basic Logic Inferring Engine. 
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In the inductive learning phase, decision trees are made from the training 
data set by using IDS. The decision tree can specify which class the training data 
whose class is unknown belongs in. The set and reset conditions are acquired 
resoectuvely, because they are independent of each other. The training data sets 
whose class is “set” and “0-continue” are used in order to find the set conditions. 
The reset condition is learned from the training data sets whose class is “reset” 
and “1-continue” . 

In the ladder transforming phase, the decision trees are converted into control 
logic in the form of a ladder diagram, which we can easily understand. Finally, 
when all of the generated logic of each output is merged, a completed control 
logic in a ladder diagram is regenerated. 

4 Interior Coil Logic Inferring Method 

4.1 Problem of Interior Coil 

When control logic contains interior coils, a state of interior coil can or may 
change a signal of output, as shown in Fig0 In the figure, the signal of M 
does not appear in the time series data, because M is the interior coil, and the 
training data made from the time series data includes “conflict data,” which 
have the same attributes but have different classes. A control logic that contains 
interior coils can not be acquired by using inductive learning, because of conflict 
data. Therefore, it is necessary to infer interior coil logic when conflict data is 
contained in the training data, and add attributes about interior coil to training 
data. 

There are two problems in the inference of interior coils: when the interior coil 
is set or reset is unknown. Also, it is impossible to acquire set or reset conditions 
of the interior coil independently or separately, because of their mutual relation. 
Therefore, SPAIR infers interior coil logic in the following way. According to 
target output, conflict data is divided into “set conflict data” target output is 
“1” and “reset conflict data” target output is “0”, as shown in figO Interior 
coil logic can be presented by the combination of set and reset condition that 
satisfies following two consistents: 

Going upstream on time series data from set conflict data, data that satisfies 
set condition appears earlier than data that satisfies reset condition 

Going upstream on time series data from reset conflict data, data that satis- 
fies reset condition appears earlier than data that satisfies set condition 

Then, interior coil logic is acquired by searching from all combinations of 
conditions. But this method requires much time and space. 



4.2 Approach for Inferring 

For an efficient inferring of interior coil logic, we proposed an interior coil infer- 
ence method called Logic Parts Attaching Algorithm 0. In this method, interior 
coil logic is made by repeatedly and selectively attaching logic parts, as shown in 
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Fig. 3. Problem of Interior Coils. 



Fig. 0 This method acquires interior coil logic by repeatedly using two phases, 
the attaching phase and the selecting phase. 

In the attaching phase, some candidates for interior coil logic, called “candi- 
date logic” , are made. If all of candidate logic are made, the number of candidates 
is enormous and the inference is inefficient. 

In the selecting phase, when the logic that can classify all of conflict data 
exists in candidate logic, the logic is just interior coil logic. Otherwise, the most 
promising candidate logic that is close to interior coil logic is selected and at- 
tached logic parts again in the next attaching process. The selected logic is called 
“tentative logic” . We defined that the distance to interior coil logic is accuracy 
when a candidate logic classifies the conflict data. 

Then, the problems are the following two points. One is how to define the 
distance to interior coil logic in the selecting phase, the other is how to judge a 
logic that becomes the interior coil logic. We propose definition of the distance to 




Fig. 4. Logic Parts Attaching Algorithm. 
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Fig. 5. Classify Blocks to Set or Reset Group. 



interior coil logic and the method of selecting tentative logic in id.dl and judgment 
of candidate logic which have the possibility to become interior coil logic and 
the method of attachment in EH 

4.3 Selecting Tentative Logic 

In this method, the candidate logic that is the closest to the interior coil logic 
needs to be selected for the next tentative logic. We defined that the distance 
to interior coil logic is determined by degree how correctly a candidate logic 
can classify the conflict data. In the judgment of the distance, the signal of the 
interior coil is not acquainted from output of the control unit. Therefore, it is 
necessary to assume the signal. Additionally, searching the logic from all over 
the time series data, causes a large amount of complexity. Therefore, the time 
series data is divided into blocks before starting the inference. 

First, it is assumed that the signal of the interior coil is “1” at set conflict 
data, and the signal is “0” at reset conflict data. It does not matter that the 
assumption is wrong, because the logic by exchanging between set condition and 
reset condition is logically the same as the actual target interior coil logic. 

Interior coil is set in the period from reset conflict data to set conflict data, 
and reset in the period from set conflict data to reset conflict data. In other 
words, the interior coil logic is related to the time series data downside of the 
conflict data. Therefore, the time series data is divided to blocks between the 
conflict data. The block is classified to following two blocks, as shown in Fig. El 

Set block: Block that includes set conflict data. 

Reset block: Block that includes reset conflict data. 

The block is also classified into the following two groups by applying the 
conditions of candidate logic, as shown in Fig. 0 

Set group: Going upstream on the time series data from the conflict data, 
the data that satisfies the set condition of candidate logic appears earlier than 
the data that satisfies the reset condition. 

Reset group: Going upstream on the time series data from the conflict data, 
the data that satisfies the reset condition of candidate logic appears earlier than 
the data that satisfies the set condition. 
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In set block, interior coil is set at the set conflict data, so that data that 
satisfles the reset condition does not appear after data that satisfles the set 
condition. That is, that all of the set blocks should be classified to the set group 
by the interior coil logic. Similarly, all of the reset blocks should be classified to 
reset group. 

Second, the distance to interior coil logic is defined by the following method. 
The distance to interior coil logic is determined by the degree of how the blocks 
of the time series data are correctly classified. Correct classification means that 
a set block is classified to set group and a reset block is classified to reset blocks. 
When the candidate logic that can exactly classify all blocks to the correct 
groups, the candidate logic is just the interior coil logic. 

While any candidate logic can not classify all blocks correctly, the candidate 
logic that can classify the blocks most efficiently is selected as the new tentative 
logic. Then, the concept of entropy is introduced for the quantitative expression 
of classification correctness. We defined that the most promising candidate logic 
is the candidate logic whose entropy gain is the largest of all. 

As shown in Fig.0 entropy gain^ is calculated for each result of classifying 
blocks by all candidate logic. Then, the candidate logic whose entropy gain is 
the largest of all is selected as the next tentative logic. In the figure, candidate 
logic 2 has the largest entropy gain of all, therefore, candidate logic 2 is selected 
as the next tentative logic. 

4.4 Attaching Logic Parts 

In order to enable the tentative logic become interior coil logic by attaching 
a new logic part, it is necessary that the candidate logic can correctly classify 
conflict data that the tentative logic cannot. In an attaching process, candidate 
logic that can improve the current tentative logic should be made for an effective 
inference. 

In Fig. 0 block E is a block that is classified to incorrect group, called an 
“error block”, because block E is a set block but is classified to reset group. The 
reason why block E is classified to reset group is that the data which satisfles 
reset condition appears earlier than the data that satisfles set condition, going 
upstream on time series data from set conflict data in block E. The data should 
to satisfy the set condition earlier than the data that satisfles the reset condition, 
in order to classify block E to set group. 

Therefore, we propose the following method. Attaching logic parts to reset 
condition in series(AND connection) enable data that has satisfied reset con- 
dition to become not to satisfy the reset conditoin. It reduces the probability 
of classifying blocks to the reset group. Similarly, attaching logic parts to set 
condition in rows(OR connection) enable data which has not satisfied the set 
condition satisfles the set condition. It raises the probability of classifying blocks 
to the set group, because set conditions may be satisfied again after the reset 
condition is satisfied. Further, these operations raise the number of blocks that 
are classified to the reset group, and blocks that are classified into the set group 
is not affected by that. Similarly, if reset blocks are classified to the set block. 
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it is necessary to attach logic parts to the set condition in series or to the reset 
condition in rows. 



5 Application Result 

In order to confirm the effectiveness of SPAIR, we applied it to an automatic 
well model, as shown in Figure El The installed control logic is shown in Figure 
El This logic has one interior coil. “Mode 1 Switch” (S'!), “Mode 2 Switch” (S 2 ) 
and “Tank B Valve” (Vb) turn on and off at random. “Drug Tank Valve” (Vd) 
and “Mixer” (Mi) are turned off automatically after a given amount of time has 
passed. 

10,000 steps time series data were accumulated from this model. They were 
edited into 1,500 arranged time series data during preprocessing. Basic logic 
inferring method and interior coil logic inferring method of SPAIR are applied 
to these data. As a result, generated the control logic is shown in Figure E3 

Comparing the control logic that is inferred by this system to the installed 
control logic, yi D ye are equal except for the fact logic that does not occur 
while the time series data is being accumulated. 

We also applied the proposed method to another 2 model (one of them is a 
real conveyer plant), and confirmed the effectiveness of the method. 
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Fig. 9. Installed control logic. 
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6 Conclusion 

In this paper, we have presented the SPAIR method, which consists of two 
method: basic logic inferring method and interior coil logic inferring method, for 
a logic regeneration of the operating plants. We made a sample sequential model 
and simulated it in order to verify a control logic inferring algorithm. As a result 
of this simulation, we know that the control logic is inferred. 

However, we have some problems to solve in order to apply SPAIR in practi- 
cal. First, emergency logic inference is necessary. In time series data, the signals 
by emergency logic rarely appear. Therefore, we have to develop the method for 
emergency logic inferece from data except time series data and combine it and 
the logic inferred by normal SPAIR. 

In interior coil inference, when an interior coil refers another interior coil, 
the current inferring method cannot infer the logic. In some cases, repeated 
apprication of the current method can infer it, but it cannot in the cases that 
those coils refer each other. Therefore, the inference method for interior coils 
with interrelationship is needed. 
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Abstract. Field evaluation of AI systems or software systems in general 
is a challenging research topic. During the last few years we have devel- 
oped a software-based eye screening system. In this paper we describe our 
work on evaluating several important aspects of the system. We have sys- 
tematically studied the key issues involved in evaluating software quality 
and carried out the evaluations using different strategies. After a brief 
introduction of the system, this work is described from a data-analysis 
problem-solving perspective, involving problem analysis, data collection, 
and data analysis. 



1 Introduction 

A sensible evaluation of AI systems or software systems in general has always 
been a challenging research issue p I IblDj . A careful assessment of such systems 
in laboratory environments is important but is no substitute for testing them in 
real-world environments where they are developed for. This is especially impor- 
tant in medical informatics applications where their use in clinical situations is 
vital 1 1 b) . 

In the last few years, we have developed a software-based visual field screen- 
ing system that integrates a visual stimuli generating programme with a number 
of machine learning components m- This system was developed in response to 
the practical need to screen subjects in various public environments where the 
specialised instruments for examining the visual field cannot be made available. 
In particular, the system was designed to detect glaucoma and optic neuritis 
effectively. It was based on the Computer Controlled Video Perimetry (CCVP) 
m the first visual stimuli generating programme implemented on portable 
PCs. CCVP had demonstrated some early success in detecting visual field dam- 
age, especially under certain controlled test environments P3C3!. In particular, 
CCVP introduces to the test subject various stimuli at predetermined locations 
in the visual field and obtains repeated measurements over these locations. The 
subject’s response to each stimulus is recorded as having recognised, or hav- 
ing failed to recognise the stimulus. The results are subsequently analysed to 
determine the possibility of eye disease. 

D.J. Hand, J.N. Kok, M.R. Berthold (Eds.): IDA’99, LNCS 1642, pp. 461-|4_^ 1999. 
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One of the extensions to the CCVP programme is the development of a 
user-friendly interface. One key consideration in the development of this inter- 
face was how to handle the problem of human behavioural instability during 
the test so that the subjects would make fewer mistakes (false positive or false 
negative responses). As learning, inattention and fatigue are among the major 
behavioural factors, various measures have been taken to address these issues in 
the interface development. These include a feedback system to indicate the sub- 
ject’s performance using sound and text, the design of interesting test stimuli, 
and customised test strategies for individuals. Moreover, attempts were made 
to develop an adaptive interface using machine learning techniques where the 
number of repeated measurements from each individual may vary El 

To deal with the possibility of collecting unreliable data, an on-line neural 
network “stability analyser” component was adopted to clean the data, to judge 
whether the current test is stable and whether an immediate follow-up test should 
be conducted on the subject |^. This turns out to be an important addition to the 
system capability since major field studies are typically expensive, and therefore 
not conducted often. It is important to collect reliable data from the subjects in 
the investigation under consideration. 

To evaluate this screening system, various characteristics affecting software 
quality were systematically studied and clinical data collected from laboratory- 
based and field-based investigations in different communities were carefully anal- 
ysed. This paper examines this evaluation effort, describes it from a data-analysis 
problem-solving perspective, and discusses what has been learned. 



2 Problem Analysis: Evaluation Issues 

Understanding and formulating of an analysis task is the first step to addressing 
the problem. In this context, various factors affecting the application develop- 
ment are analysed, including the key requirements from the application, human 
and organisational constraints, what data should be collected, and possible legal 
implications. The nature of the problem solving task is defined. It was found that 
problem formulation is one of the most challenging parts of the data analysis 
process, which has yet to receive sufficient attention 0 . 

Much research has been carried out on how to evaluate software systems HEl. 
In particular, there is an ISO International Standard (ISO/IEC 9126) which de- 
fines and details various software evaluation characteristics, including functional- 
ity, reliability, usability, efficiency, maintainability, and portability [B|. The space 
limit here prevents us from giving a detailed analysis of all the characteristics 
for our application, so we shall focus on functionality, reliability, and efficiency. 
These are among the most important characteristics for screening applications. 

Functionality: This is used to refer to a set of functions that satisfy stated 
needs for an application 0 . For the screening application we want to see the sys- 
tem to be able to detect as many as possible of those subjects in the community 
who suffer from an eye disease at an early stage, and at the same time, to min- 
imise the number of “false positives” - those who failed the test, but have no eye 
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disease. Note that “in the community” implies that it is of crucial importance 
that the system be tested in different public environments. 

Reliability: This is defined as the capability of software to maintain its level 
of performance under stated conditions for a stated period of time 0. One of 
the most important criteria for screening applications is how reliably the data 
collected by the system reflect a subject’s visual functions or damages. We have 
proposed two criteria for measuring such reliability: 1) the consistency between 
the repeated test results from the same subject; 2) the agreement between disease 
patterns discovered from our test and those from other established screening 
instruments. 

Efficiency: Efficiency is concerned with the relationship between the level 
of performance of the software and the amount of resources used 0. In the 
screening context, this is about how to minimise the amount of time a subject 
has to spend on a single test visit, while maintaining the quality of the test 
results. Since immediate follow-up tests may be recommended to obtain reliable 
test results during a single visit (see section 1), the following two questions are 
worth asking: 1) What is the minimum number of repeated measurements during 
a test to maintain the quality of test results? 2) If there is a need for on-line 
follow-up tests, what is the minimum number of follow-ups for subjects? 

Apart from a detailed examination of what should be evaluated, one should 
also consider other issues regarding what needs to be done in order to allow for an 
effective evaluation. For example, how should we select the target population? 
The answer to this question depends very much on what kind of disease one 
aims to screen for. If one tries to screen for glaucoma, the second largest cause 
of blindness in the developed world that affects one-in-thirty people over the 
age of 40, then it probably makes sense to set up such a test in GP clinics and 
to screen mainly those over 40. On the other hand, if one aims to screen for 
optic neuritis, the most common optic-nerve disease affecting young people, it is 
desirable to set up such a screening test in African or Central/South American 
countries where this disease is particularly common. 

Naturally what kind of data are to be collected and the size of target popu- 
lations should be important concerns. These are closely related to the particular 
operating constraints imposed by the corresponding investigation, and in screen- 
ing applications, it often means taking as large a sample as time or cost would 
allow. Amongst the data collected from subjects are the following two items: 
subject’s responses to the repeated measurements, and subject’s response time 
(time taken from a stimulus displayed on the screen to the moment the subject 
responds). 

Another important concern would be the possible legal implications should a 
patient sue a doctor who had access to the screening system. This is a complex 
issue and several guidelines are listed in US! to avoid negligence claims, includ- 
ing that the system has been carefully evaluated in laboratory conditions, the 
system provided its user with explanations or the opportunity to participate in 
the decision-making process, and no misleading claims are made regarding the 
capability of the system. 
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3 Data Collection 

The screening system has been used in both hospital-based and field-based in- 
vestigations. In this section we briefiy describe the settings of two major field 
studies. The first is the World Health Organisation programme for preventing 
optic neuritis in the Kaduna State, Nigeria j2|. Kaduna was chosen since this is 
an area particularly known for being endemic for optic neuritis. The exact cause 
for this disease is still unknown, but the symptoms are blurred central vision, 
reduced colour vision, and reduced sensation of light brightness. The other is 
a pilot study to detect people with glaucoma, a common disease with the el- 
derly in the UK, sponsored by the Medical Research Council m Glaucoma is 
a condition, sometimes associated with high pressure in the eye, that over many 
years can damage the retinal nerve fibres at the back of the eye. If left untreated, 
glaucoma can lead to complete blindness. 

In the optic-neuritis study, the subjects were from a farming community in 
Kaduna, who were largely computer-illiterate. The visual field tests were car- 
ried out in village huts on consenting subjects aged 15 years and over in several 
rural communities that were endemic for optic neuritis in the guinea savannah 
of Kaduna State, Northern Nigeria. These tests were conducted on a random 
sample of all those subjects that had failed some standard visual function tests. 
In addition, a population which was not endemic for optic neuritis with similar 
ethnic, cultural, educational and geographic backgrounds to the endemic pop- 
ulation was examined as a control population. In all, 3182 test records from 
2388 different eyes were collected using six notebook computers operated by 
ophthalmic nurses. 

In the glaucoma study, the test was offered during routine attendance at a 
large urban general practice in North London and was conducted in a corner of 
the main waiting room separated by a cotton screen. Although the test is con- 
ducted by patients themselves, a nurse was on side to communicate with patients 
before or after the test, e.g. inviting the patient, conducting a questionnaire, and 
obtaining a hard copy of the test results. For a three-month period during the 
pilot study, all patients aged 40 or over who routinely attended the practice were 
offered the test. Upon entering the clinic, each patient was given an information 
sheet explaining the purpose of the pilot study, the nature of glaucoma and the 
visual field test, what to expect during and after the test, and information about 
whom to contact if they wished to know more about the test in general or were 
concerned about their own results. A consent form for taking the test was then 
signed by each interested patient. Patients with known glaucoma were excluded 
from the study, so were those too ill or incapable of completing the test. More 
than 900 people were screened and over 2000 test records were collected during 
the screening period. 

In each of these two studies, there were a number of subjects who were 
subsequently invited back to undertake a thorough re-examination by ophthal- 
mologists where various other tests were conducted to determine the nature of 
disease. The test data corresponding to these subjects were then used to evaluate 
the screening system. 
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4 Data Analysis 

This section describes the methods and results of evaluating the screening sys- 
tem from the following software characteristics: functionality, reliability, ef- 
ficiency. Since the main functionality for the screening system is its “discrim- 
inating power”, we aim to establish the system’s capability in maximising the 
chance of detecting those in the community who suffer from an eye disease at 
an early stage, while minimising the number of “false positives” - those who 
failed the test, but have no eye disease. We have used the Receiver Operator 
Characteristic (ROC) analysis [7] and associated methods for this purpose. 

In assessing reliability, we are interested in seeing how reliably the data 
collected by the system reflect a subject’s visual functions. To this end, we de- 
vised two evaluation strategies. The first is based on the notion of reproducibility 
of test results: the results should be reproducible for the same subjects if the 
tests are carried out in close time proximity in the same testing environments. 
The second strategy is based on checking the agreement of disease patterns be- 
tween this and other conventional visual field tests. Since many conventional 
visual-field testing instruments have been clinically demonstrated to be reliable, 
it would be useful to check if results produced by our system are consistent or 
not with those from established testing instruments. 

Regarding the efficiency assessment, we researched into two different ways 
of finding out the minimum number of repeated measurements for an individual 
test while maintaining the quality of the test results. Further, since one of the 
major features of the system is the capability of judging whether the current test 
applied to a subject is stable, and whether an immediate follow-up test should 
be conducted, how efficient is this aspect of the screening system? 

4.1 Functionality 

To examine the discriminating power or the main functionality of the test, 
ROC analysis was considered as the most direct method. ROC curves are drawn 
to assess a test’s diagnostic performance by displaying pairs of sensitivity and 
specificity values throughout the whole range of a test’s measurements (see Fig- 
ure 1). While curves shifted towards the upper left of the diagram, performance 
of the test is improved in terms of both sensitivity and specificity. The decision 
threshold used for discriminating between normal and abnormal subjects is the 
average percentage of positive responses within a test | 2 | . 

We applied ROC analysis to both the original CCVP data and the data 
collected by the integrated screening system. Amongst the 3182 test records 
collected in Nigeria, we have 181 different eyes which have been assessed to have 
optic neuritis and 352 who have no such disease. These findings were made by 
ophthalmologists who examined other signs, symptoms or test results, which are 
considered to be independent of the CCVP test. Figure 1 gives the sensitivity 
and specificity results from those 533 different eyes obtained under both the 
original CCVP and the integrated screening system. The results of the original 
CCVP with regard to the disease are represented in solid curve, while those of 
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Specificity 




Fig. 1. Sensitivity and specificity from the two systems 



the integrated system are given in dashed line. Although there are areas where 
the CCVP test has shown better sensitivity (bounded between 20% and 55% 
sensitivity, and between 90% and 100% specificity), the integrated screening 
test has higher sensitivity and specificity than the CCVP test for all other areas, 
particularly in the top left corner. This area is the most important since we are 
trying to find out which test can maximise both sensitivity and specificity for 
detecting the disease. 

The ROC analysis was also applied to the glaucoma data collected from the 
opportunistic glaucoma study in London and overall results obtained are similar 
to those from the optic neuritis data. Among all the 925 people screened during 
the three-month period, 33 failed the test. All these 33 people, together with 45 
chosen from those who passed the test (controls), were later assessed clinically 
in the practice by an ophthalmologist. The group who failed the test had many 
more eye problems than the group who passed the test. For example, 70% of 
those who passed the test (controls) had a normal visual field, and there was not 
a single confirmed glaucoma case found in this group. On the other hand, 82% of 
the people who failed the test had various visual defects including 34% confirmed 
glaucoma cases, 9% glaucoma suspects, 24% cataract cases and 15% other visual 
defects. It is encouraging to note that, among those whose visual defects were 
first detected by our screening system, most did not consider themselves to have 
any eye problem when they visited the clinic. 

4.2 Reliability 

Reproducibility of Test Results. As glaucoma is a long term progressing 
disease, the visual function should remain more or less the same during a short 
period of time. Therefore results from two such repeated tests within this time 
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Fig. 2. Results of the original CCVP system 




Fig. 3. Results of the integrated screening system 



period should be very close. However, this is not always true under real clinical 
situations as measurement noise is involved in each test, perhaps for different 
reasons. Thus it is not surprising to note that there are a large number of re- 
peated tests, which were conducted within an average time span of one month, 
and whose results showed disagreements of varying degrees (see the CCVP re- 
sults in Figure 2). In the figure the dot is used to indicate the result of the first 
test, the oval is used for the result of the second test, and the difference between 
the two results for each case is illustrated by the line in between them. There are 
a number of repeated measurements for each test, and the “average sensitivity” 
is the average of the percentages of positive responses within all those repeated 
measurements. 

Since one of the main reasons for the disagreement is the measurement noise, 
it is natural to expect that the sensitivity results of two consecutive tests should 
agree (to varying degrees) after the noise is discarded. As the integrated system 
has the data cleaning capability, we can then see whether the results of repeated 
tests are consistent or not. And this is indeed the case: the lines between two 
consecutive tests are in general shortened in Figure 3, shown by the results 
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of experimenting with nearly 100 pairs of test records which had significant 
disagreements between two consecutive tests for the CCVP test (Figure 2) m- 



Pattern Agreement. Another way of assessing the reliability of the screening 
system is to analyse how different eye diseases may manifest themselves on the 
test data. A reasonable assumption is that the visual damage patterns detected 
should be similar to those from conventional testing instruments, or from clinical 
topographic analysis. It is understood that glaucoma damages tend to first occur 
in one “hemifield” , which is either the upper or lower half of the retina, or the 
corresponding area of the visual field. In other words, abnormal locations are 
more likely to be found within a single hemifield, according to findings from 
conventional testing instruments. However, this pattern is not found for optic 
neuritis studies. 

So what patterns can be detected from the data collected by our screening 
system? We took a large sample of clinical test records and analysed them using 
an interactive data exploration procedure in which the data analyst steers the 
discovery process 0 . The analysis was an interactive and iterative process where 
the analyst needed to make a number of decisions, e.g. whether more specific data 
selection should be made or alternative action be taken, whether any patterns 
detected had any significant meanings, and whether they could be validated. We 
have experimented with over three thousand clinical test records from patients 
with visual field loss from glaucoma and optic neuritis 0. 

Our findings regarding glaucomatous data are indeed consistent with those of 
early research on conventional visual field test methods. The correlations among 
locations within either hemifield were found to be strong, whereas correlations 
between any two locations across hemifields were much weaker. For optic neu- 
ritis data we have also found strong correlation between certain pairs of retinal 
locations, but in this case across the two hemifields j2|. Topographic analysis 
of clinical chorioretinal changes related to the sensitivity at these test locations 
was conducted for optic-neuritis subjects, which has confirmed this finding. 



4.3 Efficiency 

To evaluate the efficiency of our system in terms of testing time versus reliability, 
we tried to find out the minimum number of repeated measurements for an 
individual test. Two strategies have been suggested. The first was based on 
the idea of prediction/classification and was tested on a set of visual field data 
involving six test locations, four test stimuli and ten repeated measurements. 
Various techniques such as neural networks, multiple regression and decision tree 
induction were used to do the prediction. The other strategy was based on the 
idea of clustering and was applied to a set of clinical glaucomatous test records. 
The results from these experiments have found the two strategies in general 
capable of keeping the number of repeated measurements low for individual 
subjects, while maintaining the quality of the test results 0. However, further 
research is required to have a thorough and careful comparative study of these 
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two strategies as they have different underlying assumptions as well as different 
theoretical foundations. 

To evaluate the efficiency of on-line follow-up tests for a single test visit, 
a strategy was proposed and applied to the data from the optic-neuritis study 
in Nigeria In particular, follow-up tests were carried out irregularly on 532 
different eyes. We have found that 371 (70%) subjects were stable and the re- 
maining 161 subjects were unstable in their first tests. Among those unstable 
subjects, 95 became stable in their second test. In the rest of 66 subjects, 51, 
though indicated unstable, have similar patterns to those from the first test. 
Only 2.8% of subjects are still left undecided. These observations indicate that 
one test is sufficient to obtain stable test results for 70% of subjects, while one 
immediate follow-up test is normally adequate for the rest of the population. 

5 Concluding Remarks 

We have applied a software quality evaluation process to the assessment of an 
eye screening system and we have learned a lot, particularly the following: 

1. Problem analysis: we have found it very important to carefully analyse var- 
ious factors affecting the system evaluation, including evaluation objectives, 
target populations, related operating constraints, and what and how data 
should be collected. The feedbacks from early system trials have contributed 
in no small part to the continuing refinement of the screening system. 

2. Data collection: Data collection from large-scale field studies is expensive 
and it is crucial to have the necessary resources and close collaboration from 
field staff. For example, we were fortunate to be supported by the British 
Council for the Prevention of Blindness and the World Health Organisation 
for the field study in Nigeria to make the data collection possible. In all, two 
Land Rovers and six portable PCs were used by dedicated staff in screening 
several thousand villagers who collaborated in the trial. 

3. Evaluation: we have adopted a systematic way of evaluating different aspects 
of the screening system. Although we only have space to present results from 
the evaluation of the three main characteristics: functionality, reliability and 
efficiency, results from other characteristics are encouraging too. For exam- 
ple, the system has been found to have a higher than expected acceptability. 
During the three-month glaucoma study the test was offered to 1215 subjects 
of whom 925 (76%) accepted (the acceptance of opportunistic tests in city 
practices ranged from 50% to 70% in general). This was achieved despite 
there being little work to increase the number of patients taking part in the 
study (no advertising, and little stimulation from staff in the clinic) . This is 
an encouraging finding regarding the “usability” of the system. 

4. Future research: this will include further usability testing of the screening 
system. Moreover, the consistency checking between the results obtained by 
subjects’ responses and subjects’ response time may offer yet another inter- 
esting way of assessing the system’s reliability. A more thorough evaluation 
using various methods 0 will also be attempted. 



470 Gongxian Cheng et al. 



Finally, this is a truly interdisciplinary project in which community health ex- 
perts, computer scientists, epidemiologists, eye specialists, general practitioners, 
and ophthalmic nurses have worked together in identifying the system require- 
ments, designing and implementing the system, testing the system in different 
operating environments, analysing the data collected, and continuously refining 
the software. The collaboration from the trial communities is also vital as this is 
directly related to the quality of data collected. Currently the test is being used 
in more public environments and the data collected will allow us to improve the 
system further. 
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Abstract. This paper presents application of Rough Sets algorithms to 
prediction of component failures in aerospace domain. To achieve this 
we first introduce a data preprocessing approach that consists of case 
selection, data labeling and attribute reduction. We also introduce a 
weight function to represent the importance of predictions as a function 
of time before the actual failure. We then build several models using 
rough set algorithms and reduce these models through a postprocessing 
phase. End results for failure prediction of a specific aircraft component 
are presented. 

1 Introduction 

Rough Sets theory was first defined by Pawlak mm . During the last few years it 
has been applied in Data Mining and Machine Learning environments to different 
application areas m- As demonstrated by these previous applications and its 
formalized mathematical support. Rough Sets are efficient and useful tools in 
the field of knowledge discovery to generate discriminant and characteristic rules. 
However, in some cases the use of this technique and its algorithms requires some 
preprocessing of the data. In this paper, we explain the application of the Rough 
Sets algorithms and the preprocessing involved in order to use these techniques 
for prediction of component failures in the aerospace domain. 

In today’s aerospace industry the operation and maintenance of complex 
systems, such as commercial aircraft is a major challenge. There is a strong desire 
to monitor the entire system of the aircraft and predict when there is a potential 
for certain components to fail. This is specially true when in modern aircraft 
there is access to complex sensors and on-board computers that collect huge 
amounts of data at different stages of operation of the aircraft and transmit this 
data to ground control center where it is available in real-time. This information 
usually consists of both text and parametric (numeric/symbolic) data and it 
exceeds 2-3 megabytes of data per month for each modern aircraft. In most 
cases this data may not be used or even properly warehoused for future access. 
Several reasons exist: (i) engineers and operators do not have sufficient time 
to analyze huge amounts of data, unless there is an urgent requirement, (ii) 
complexity of the data analysis process is in most cases beyond the ordinary 
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tools that they have, and (iii) there is no well defined automated mechanism to 
extract, preprocess and analyze the data and summarize the results so that the 
engineers and technicians can use it. 

Several benefits could be obtained from proper prediction of component fail- 
ures. These are: (i) reducing the number of delays, (ii) reducing the overall 
maintenance costs, (iii) potential increase in safety, and (iv) preventing addi- 
tional damage to other components. 

The data used in this research comes from automatically acquired sensor 
measurements of the auxiliary power units (APU) of 34 Airbus A320 aircraft. 
This data has been acquired between 1994-97 and it consists of two major parts: 
(i) all repair actions taken on these aircraft, and (ii) all parametric data acquired 
during the operation of these power units. Examples of problems with this data 
were: missing attributes, out-of-range attributes and improper data types. Af- 
ter cleaning the original data, a data set consisting of about 42000 cases was 
prepared. 

Our goal was to use this data to generate models (in the form of rules) 
that explain failure of certain components. These rules would then be used in 
a different system in order to monitor the data and generate alerts and inform 
the user when there is a potential for certain components to fail. This paper 
explains the process and the results of our research for the use of Rough Sets 
in prediction of component failures. In Section 2 we provide an overview of the 
approach. Section 3 includes the data preprocessing procedure and in Section 4 
we explain the process of building a model. Section 5 contains the results and 
Section 6 is conclusion and future work. 

2 Overview of the Approach 

The aim of the rule extraction process described in this paper is to generate 
a valid set of prediction rules for aircraft component failures. These rules will 
have to accurately recognize particular patterns in the data that indicate an 
upcoming failure of a component. 

The rule inference process starts by the selection of the data related to the 
component of interest. This is done in two steps. First, we retrieve, from the 
historical maintenance reports, the information about all occurrences of failure 
of the given component. The information retained is the failure dates along with 
the identifiers of the aircraft (or engine) on which the failures happened. Then we 
use this information to retrieve all the sensor measurements observed during the 
preceding days (or weeks) of each failure event. We also keep some data obtained 
during the days following the replacement of the component. Two new attributes 
are added to the initial raw measurements: the time between the observation is 
collected and the actual failure event, and a tag identifying each observation to 
a specific failure case. The data from all failures are finally combined to create 
the dataset used to build the predictive model. 

In order to use a supervised learning approach such as Rough Sets algorithms 
as well as many others unna, we must add another attribute to the dataset 
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just created. That is the CLASS (or LABEL) attribute. The algorithm used to 
generate this new attribute is also called labeling algorithm. 

In our case, the labeling algorithm creates a new attribute with two different 
values (0 and 1). This new attribute is set to 1 for all cases obtained between 
the time of the failure and the preceding n days (these n days define the win- 
dow that we target for the failure predictions), and set to 0 for all other cases 
observed outside that period of time. Following the labeling of the data, some 
data preprocessing is performed which is explained in Section 0 

The next step is to build the models. This includes: selection of the relevant 
attributes, execution of Rough Sets algorithms, and post-processing of the re- 
sults. Finally, the end results are evaluated. The overall process is summarized 
in Figure ^ 




Repeat 
N times 




Fig. 1. General rule extraction procedure. 



3 Data Preprocessing 

This section explains preprocessing steps required before the application of the 
Rough Sets algorithms. 



3.1 Discretization Algorithm 

One of the requirements of all standard Rough Sets algorithms is that the at- 
tributes in the input data table need to be discrete (also known as nominal 
attributes). However, in the aerospace domain, the sensored data usually con- 
sists of continuous attributes and therefore a discretization process is required. 

Discretization algorithms can be classified by two different criterion. The 
first division of these techniques is between loeal or global algorithms. Local al- 
gorithms are considered as some form of an induction algorithm (like C4.5 1 1 3j 1 . 
These algorithms perform partitions that are applied in some iterations of the 
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induction process such as in a number of nodes during tree construction. Global 
algorithms are used to transform continuous attributes into nominal attributes 
in a preliminary preparation task and with no direct interaction with the sub- 
sequent analysis processes. The second classification of discretization techniques 
defines supervised and unsupervised methods. Supervised algorithms use label 
(or class) information to guide discretization process and unsupervised methods 
apply different kinds of discretization criteria (such as equal interval width or 
equal frequency intervals). 

In our experiments, we have discarded loeal methods because: (1) global algo- 
rithms are less prone to variance in estimation from small data size (some exper- 
iments 0 with C4.5 have been improved using preliminary global discretizations 
before C4.5 induction with no local discretization) and (2) our rule extraction 
process is performed by Rough Sets algorithms that require the previous dis- 
cretization. We have chosen supervised techniques because using classification 
information we can reduce the probability of grouping different classes in the 
same interval 0. Some typical global supervised algorithms are: ChiMerge |E|, 
StatDisc (Oj (both of them use statistical operators as part of the discretiza- 
tion function), D-2 (entropy-based discretization 0), and MCC (find partition 
boundaries using contrast functions PH). But we have chosen InfoMerge P3 , an 
information-theoretic algorithm, that substitutes ChiMerge / StatDisc statistical 
measures with an information loss function in a bottom-up iterative process. 
This approach is similar to C4.5 local discretization process but in order to ap- 
ply it into a global algorithm a correction factor need to be used. This factor 
adjusts information function using interval weight (number of elements). 

3.2 Weight Function 

The second transformation operation is not so closely related to algorithm re- 
quirements and its application is motivated by a better rule quality at the end 
of the process. As described in Section 0 the labeling mechanism selects all the 
records in the last 30 days before the failure as positive data (the rules generated 
by the model will discriminate this time window from the data before and after 
this period) . But the importance of the detection of this situation is not the same 
during all this period. For example, a component failure alert 20 days before the 
possible failure is less important than 5 days before and alerts too close to the 
failure do not allow any corrective actions. This domain characteristic can be 
described as a weight function as shown in the Figure 0 This weight function 
example defines three different values connected by a step function and it is 
an example of the distribution of the importance of alerts for this component. 
All algorithms of the procedure have been revised in order to use this weight 
function. 

4 Building a Model 

In this section, the three main steps of the model building phase are described 
in detail. These steps are: i) attribute reduction, ii) rules extraction, and iii) 
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T, Tj T, T, 0 (failure) 

Time (Days or Number of Uses) 

Fig. 2. Weight function example. 



rules post-processing. In this research, Rough Sets algorithms have been used to 
implement each of these phases. 

4.1 Attribute Reduction 

In this phase of the process, we select from an original set of attributes, provided 
by the user, a subset of characteristics to use in the rest of the process. The 
selection criteria are based on the reduct concept description, as defined by 
ng. The term REDUCT is defined as “the essential part of knowledge, which 
suffices to define all basic concepts occurring in the considered knowledge" . In 
this problem’s context we can define reduct as the reduced set of features that 
are able to predict the component failure. 

Many different algorithms have been developed in order to obtain this re- 
duced set of attributes [I I hltij . Not all of them are suitable for our domain. For 
instance, the Discernibility Matrix algorithm m defines a triangular matrix 
with a size equal to the number of records in both dimensions. This algorithm 
would not be appropriate due to the size of the matrix it requires ( e.g. for a 
problem of 20000 records it is necessary to handle a matrix of about 200 mil- 
lion cells). Another traditional method to calculate this set is to generate all 
combinations of attributes and then evaluate the classification power of each 
combination. The usual way to perform this evaluation is to calculate the Lower 
approximation mi- Lower is a set of original records that belong to the concept 
and they are selected by an equivalence relation described by some attributes. 
These attributes are used to define this Lower region. If an element belongs to 
this approximation then it surely belongs to the class (the set of records we want 



to classify). 

U : Universe {all the records). (1) 

X : Elements that belong to the CLASS {concept). X CU (2) 
R : Equivalence relation {defined by the attributes). (3) 

Lower = {x & U \ C X} (4) 
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In our experiments, we have used a simple reduct calculation algorithm. The 
main goal was not to obtain the minimal attribute reduct, but to provide a good 
result at a reasonable cost in terms of computation time and memory used. The 
algorithm implemented also uses the Lower approximation calculation m to 
evaluate the classification power of a set of attributes in each of the iterations. 
This approximation represents the set of data records successfully classified by 
a set of attributes. Therefore, the set of attributes is designed to preserve this 
original Lower region. The algorithm pseudo code is shown in Figure 0 



1 ; AttributeSet Calculate_Reduct (Data data.AttrSet attr) 

2 ;< 

3; AttributeSet red=(!J; 

4; Float acc ,maxAcc=0 . 0 , attrAcc [attr . size() ] ; 

5; Attribute atl,at2,a,b; 

6; 

7; while(maxAcc<REQUIRED_ACCURACY) < 

8; maxAcc=0.0; 

9; for(a in attr) < 

10; attrAcc[a]=Lower_Approximation(data,red+{a)-) ; 

11; for(b in attr) ■{ 

12; acc=Lower_Approximation(data,red+-Ca,b3-) ; 

13; if (acc>=maxAcc) { 

14 ; maxAcc=acc ; 

15; atl=a; 

16; at2=b; 

17; > 

18; > 

19; attr=attr— [aj : 

20 ; > 

21 ; if (attrAcc [atl] >attrAcc [at 2] ) 

22; red=red+-[atlj ; 

23 ; else 

24 ; red=red+-[at2j ; 

25; > 

26; return(red) ; 

27;} 



IrFloat Lower_Approximation(Data data, AttributeSet attr) 
2;Pre; "’data' must be sorted by ’attr’" 

3:< 

4: Float pos=0.0,neg=0.0; cls=0.0; tot=0.0; 

5: Tuple ref erence , current ; 

6; 

7: reference=data.first() ; 

8; for(current in data) { 

9: if (IsEquaKcurrent, reference, attr)) 

10; if (IsPositive(current) 

11: pos+=current . weight ; 

12; else 

13: neg+=current . weight ; 

14: else { 

15: tot+=pos; 

16 : if (pos/ (pos+neg)>VPRSM_ THRESHOLD) { 

17: cls+=pos; 

18: Write_Rule(reference,pos,pos+neg) ; 

19: } 

20: ref erence=current ; 

21: if (IsPositive(current) 

22: pos=current . weight ; neg=0.0; 

23: else 

24: neg=current . weight ; pos=0.0; 

25: > 

26: } 

27: return(cls/tot) ; 

28 :> 



Fig. 3. Non-optimal reduct calculation / Lower approximation calculation algo- 
rithms. 



In each iteration, this algorithm first selects the best subset of two attributes 
based on the classification power (calculated with Lower_Approximation). It 
then selects the best attribute from these two. This algorithm is very efficient 
since it limits the search for the best subset of two attributes only. However, 
that limitation may also have an impact on the results obtained. It might be 
appropriate to run a modified version of this algorithm that can also search for 
the best subset of 3 attributes, or even more. 

In Figure 2] there is a comparison between the combinatorial calculation of 
the reduct and the calculation using our approximative algorithm. The figure 
pictures the number of times Lower Approximation function has to be executed. 
For example, to calculate a 5-attribute reduct from 80 original attributes, with 
the combinational approach over 30 millions Lower regions must be calculated, 
but with the other algorithm there are only 13450 regions to calculate. 
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Fig. 4. Calculation of a 5-attribute reduct 



4.2 Rule Extraction 

At the core of the building model process we find the rule extraction step. The 
algorithm to perform that step scans the training data and extracts discriminant 
rules for the the target concept using the selected subset of attributes (obtained 
from the attribute reduction algorithm, see section [4.111 . In our experiments, we 
have selected a fixed number of attributes for the reduct computation (the most 
discriminant ones, according to the reduct criteria). In other words, we forced 
the rule extraction algorithms to work with only a small subset of features. This 
constraint was necessary to limit the size of the rules generated and helped in 
keeping a good level of comprehensibility for domain experts that will have to 
review the results. 

In our experiments, we also used Lower approximation calculation to gener- 
ate the rules that describe the concept (i.e. the situations for which we should 
predict a specific component failure). Using this approach, each rule obtained 
consists of a conjunction of attribute value conditions (one condition per input 
attribute). As we will see in Section ^3 this set of rules had to be processed 
before being used to predict component failure. 

The implementation developed in our research supports Variable Precision 
Rough Set Model (VPRSM as defined by [H|) and the algorithm used is based 
on the design proposed by |S|. VPRSM extends traditional rough sets theory 
providing an inclusion threshold that allows more flexibility. With VPRSM an 
element x belongs to Lower region if more than a% of elements in the same 
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equivalence class ([x]_r) belong to the concept. The only variation of this algo- 
rithm is related to the use of the weight function and its effect on threshold 
comparison process in VPRSM (see figure 0) . 

4.3 Rule Postprocessing 

The number of rules obtained from the rule extraction process described above is 
typically very high. This section first explains why so many rules are generated 
and then, it explains an approach developed to transform the rule set obtained 
into a smaller one. 

First, one of the characteristics of rules extracted by the Lower approxima- 
tion calculation is that all the rules are expressed in terms of all the attributes 
provided to the algorithm. Each rule extracted using this technique is a conjunc- 
tion of predicates. The format of these predicates is attribute = value, and all 
the attributes appear in all the rules. Clearly, with such a representation, the 
number of rules required to cover all possibilities is very large. 

The quality of the discretization process may also have an impact on the total 
number of rules generated. Because the discretization process is independent 
of the rule extraction algorithm used, an attribute may be splitted into more 
intervals than required to generate the rules. In these cases, two or more rules 
are generated that only differ in the value of a discretized attribute and this two 
or more values represent consecutive intervals. Such a non optimal splitting of 
the attributes will contribute to enlarge the number of rules obtained. 

In order to reduce the number of rules, a two-phase algorithm has been de- 
veloped. In the first phase all the initial rules are combined to generate new 
rules, these new rules are more general (include all the elements described by 
both of the combined rules) than previous ones. This process is repeated until 
no new rule can be generated. In each of the iterations any initial or previously 
generated rules can be combined. In a second phase, all the rules that are de- 
scribed by a more general rule (all of the elements represented by the rule are 
also represented by another rule) are removed. The result of this second phase is 
a final set of rules equivalent to the original one but smaller (or in the worst case 
equal). This process cannot be achieved by a single combination/pruning phase 
since some rules may be used to generate more than one new rule. An example 
of execution of this algorithm is shown in Figure El 

The final output of this algorithm is a smaller set of postprocessed more 
general rules. These rules are finally sorted by their support. The support being 
defined as the ratio between the number of cases in which this rule can be applied 
and the total number of cases. 

5 Performance and Results 

In this section, we report the results obtained by our approach to learn models 
to predict failure of the Auxiliary Power Unit (APU) starter motor. We also 
study the relationship between two important parameters of the approach. The 
process for our experiment is as follow: 
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B 

bi 

b2 

b. 



a, aj aj 



A<a AND B<b-»C 
1 1 
A<a AND b<B<b-»C 



1 



1 



A<a AND b<B<b-»C 

1 2 3 

A<a AND b<B->C 

1 3 

a<A<a AND b<B<b-» C 

2 3 12 

a<A AND b<B<b->-C 

3 1 2 

a<A<a AND b<B<b^ C 

2 3 2 3 

a<AAND b<B<b-i-C 

3 2 3 

Original Rules ^ 



A<a AND B<b-»C 



A<a AND b<B->C 

1 2 



a<AAND b<B<b->C 



a<AAND b<B<b->C 

2 2 3 



A<a^ C 
1 

a<A AND b<B<b^^C 

2 1 3 

Final Rules 



Generated Rules 



Fig. 5. Rule postprocessing example. 



1. The data is splitted into batches. One batch being created for each failure 
case. For the APU starter problem, we had data from 30 failure cases (30 
batches were then created). 

2. We execute our approach to learn the rules using data form 29 cases and 
then use the data from the remaining case for validation. We repeat this 
step until data from each case has been used for validation (which means 30 
iterations for the current component). 

3. We use the validation results from the different runs to compute: (i)the 
number of cases for which we have at least one good alert generated during 
the prediction window(see Section E|), and (ii)the number of cases for which 
we have one ore more alerts generated outside the prediction window. In 
Table |H these two numbers are referred to as Good Alert and False Alert, 
respectively. 

We repeated the above process several times with different settings for two 
important parameters in our approach: the VPRSM threshold and the maximal 
number of intervals generated by the discretization algorithms. We experimented 
with VPRSM thresholds of .99, .97, .95, .90, and .80. Similarly, we experimented 
with values of 2, 3, 5, 7, and 10 for the maximal number of discretization intervals. 
Table □ presents the results from our experiments. The impact of these two 
parameters on the final results is very significant. In the top left side of the table, 
with high restrictive thresholds and a small number of intervals, the percentages 
of correct failure predictions and false alerts are both very low. On the other 
hand, low VPRSM thresholds and large number of intervals for discretization 
(bottom left corner of the table) lead to a high percentage of correct failures 
predictions along with an important ratio of false alerts. It is very interesting 
to note the impact of the maximal number of intervals for discretization. For 
instance, with a VPRSM threshold of .97, increasing the maximal number of 
intervals from 5 to 7 lead to an increase of 20% in the number of failures predicted 
and to a 26% decrease of the false alert ratio. 

Finally, the most interesting result was obtained with a threshold of .97 and a 
maximal of 7 intervals. This result shows a good ability of the model in predicting 
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failures of the APU starter motor (70%) with a reasonable percentage of false 
alerts (6.7%). 



Table 1. VPRSM threshold vs maximun number of intervals 



Threshold 


ih Intervals 


2 


3 


5 


7 


10 


0.99 


Good Alert: 
False Alert: 


3.3% 

10.0% 


6.7% 

6.7% 


20.0% 

10.0% 


33.3% 

6.7% 


26.7% 

10.0% 


0.97 


Good Alert: 
False Alert: 


3.3% 

10.0% 


20.0% 

33.3% 


50.0% 

33.3% 


70.0% 

6.7% 


23.3% 

10.0% 


0.95 


Good Alert: 
False Alert: 


6.7% 

16.7% 


26.7% 

23.3% 


40.0% 

10.0% 


56.7% 

93.3% 


40.0% 

33.3% 


0.90 


Good Alert: 
False Alert: 


10.0% 

16.7% 


23.3% 

20.0% 


63.3% 

43.3% 


83.3% 

66.7% 


86.7% 

96.7% 


0.80 


Good Alert: 
False Alert: 


10.0% 

16.7% 


36.7% 

30.0% 


70.0% 

66.7% 


83.3% 

96.7% 


93.3% 

96.7% 



The rules extracted by our model never have more than five attributes (pred- 
icates) . This rule size is close to the limit above which human comprehensibility 
becomes difficult. This characteristic is quite important because the predictive 
rules are processed by an automated monitoring tool that generates alerts with 
these rules and for each of the alerts the associated rule needs to be shown to 
an expert user who decides on corrective actions to be taken. An example of a 
rule obtained is: 

IF 50.000<=SMIN15<52.000 AND 713 . 000<=EMIN20 AND 522 . 000<=EMAX 
THEN "APU starter motor will fail within 15 days" 

Similar rules can be generated by other algorithms. We are experimenting 
with other systems such as C4.5 and other algorithms accessible trough MLC-I— I- 

Results obtained so far tend to show that the approach developed in this 
paper is competitive with well known decision tree systems in both the execution 
time and the accuracy of the results. For instance, the best model obtained so far 
with C4.5 has been able to correctly predict 77% of the failures with a false alert 
rate of about 9%. In terms of execution time, our Rough Sets implementation 
and C4.5 are also quite similar; each experiment for the selected component 
takes about 25 minutes with both systems. 



6 Conclusions and Future Work 

In this paper we present a new approach to the use of Rough Sets algorithm for 
prediction of component failures. Our data came from a real world aerospace ap- 
plication for which accurate predictions of component failures will be extremely 
useful. The approach consists of an extensive data reduction process, use of a 
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global supervised algorithm for discretization and a weight function to evaluate 
the performance of our experiments. The experiments carried out in our research 
revealed that the large number of rules generated by the algorithms had to be 
reduced to a smaller set for human comprehensibility. This was done using a 
novel approach that significantly reduces the number of rules without affecting 
the accuracy of the results. 

An extensive experiment has been run to verify the impact of two param- 
eters: the VPRSM threshold and the maximal number of intervals generated 
during discretization. The experiment has shown that the quality of the results 
is heavily affected by the maximal number of discretization intervals chosen. 
The experiment has also demonstrated that the overall approach is useful for 
obtaining rules that can predict up to 70% of the APU starter motor failures 
(prediction of the component targeted in this research) with a very reasonable 
rate of false alerts (less than 7%). This kind of models could lead to important 
savings for an airline. 

The research framework described in this paper can be used as a basis for our 
future research in this area. Different discretization algorithms, weight functions 
and attribute reduction techniques along with other forms of rule postprocessing 
strategies can be experimented. 
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Abstract. In this paper, we report on a set of experiments that ex- 
plore the utility of making use of the structural information of WWW 
documents. Our working hypothesis is that it is often easier to clas- 
sify a hypertext page using information provided on pages that point to 
it instead of using information that is provided on the page itself. We 
present experimental evidence that confirms this hypothesis on a set of 
Web-pages that relate to Computer Science Departments. 



1 Introduction 

The advent of the World-Wide Web has rejuvinated the interest in text catego- 
rization problems. Vast amounts of documents are available on-line, and catego- 
rizing them into meaningful semantic categories is a rewarding and challenging 
research problem. 

However, current approaches to text categorization on the Web mostly con- 
centrate on simple representation schemes that are based on word occurrence 
and word frequency. The structural information that is inherent to documents 
on the Web is often neglected. There are at least two different kinds of struc- 
tural information on the Web that could be used to enhance the performance of 
current text classification algorithms: 

— the structure of an HTML representation which allows to easily identify 
important parts of a document, such as its headings and its title, and 

— the structure of the Web itself, where pages are linked to each other in various 
ways. 

In this paper, we report on a set of experiments that explores the utility of 
such structural information. Our working hypothesis is that (at least in some 
domains) it is easier to classify hypertext pages using information provided on 
pages that point to a page instead of using information that is provided on the 
page itself. There are several reasons for this: 

Redundancy: Quite often there is more than one page pointing to a single 
page on the Web. The ability to combine multiple, independent sources of 
information can improve classification accuracy. 



D.J. Hand, J.N. Kok, M.R. Berthold (Eds.): IDA’99, LNCS 1642, pp. 487-^^} 1999. 
[fc Springer- Verlag Berlin Heidelberg 1999 
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Independent Labeling: Being able to rely on the information provided by 
multiple authors (the authors of the pages that point to the page to be 
classified) is less sensitive than having to rely on the vocabulary used by one 
particular author Cl- 

Page Sparseness: Web pages are often very sparse or contain mostly images. 
Using the links to a page increases the chances of encountering informative 
text about the page to classify. 

To investigate our hypothesis, we represent a Web page with features derived 
from information of pages that point to the page. To that end, we encode each 
hyperlink pointing to a document with its anchor text, the headings structurally 
preceding it, and the text of the paragraph in which it occurs. Then we learn a 
set of classification rules with the inductive rule learning algorithm RIPPER PJ . 
The predictions of links pointing to the same page are then combined to yield a 
prediction for this page. 

Our results show that documents can often be classified more reliably with in- 
formation originating from pages that point to the document than with features 
that are derived from the document text itself. 

2 Motivation 

Our approach for the use of structural information for classifying Web-pages 
was motivated by the following observation that we made while working with 
conventional text classification techniques on the WebKB data set0 

Observation 1: The text on the pages themselves is often insufficient or 
irrelevant for a reliable classification. 

For example, home pages of computer science departments often only consist 
of images with pointers to information about offered courses, student and faculty 
home pages, research projects, etc. Even if this information is contained on a 
single page, the words on the page itself do not provide many clues for the fact 
that we are dealing with the home page of a computer science department as 
opposed to any other page in a computer science department. 

Observation 2: Information on the pages that contain a pointer to a given 
page is much more helpful. Very often, at least one of the following three pieces of 
information contains an obvious clue for the intended classification of the page. 

1. the anchor text 

2. the context in which the anchor text appears 

3. the headings that structurally precede the section of the document in which 
the link occurs 

For example, department pages typically have a large number of links point- 
ing to them that are marked with anchor texts that include phrases like “com- 
puter science department”, “CS department”, “dept, of computer science”, or 



^ A brief description of this domain can be found in sectional 



Exploiting Structural Information for Text Classification on the WWW 



489 



similar. Each of them should be sufficient to identify the link as pointing to the 
page of a computer science department. Student home pages very often contain 
a pointer to their advisor’s home page. Thus, faculty home pages can often be 
identified by the occurrence of the word “advisor” in the neighborhood of a link 
that points to the page. Furthermore, many computer science departments have 
a page that lists all students, faculty, staff, projects, courses, or other informa- 
tion. Typically such a page (or segment of a page) starts with a heading that 
identifies the type of information that is listed below it. Clearly, this information 
can also be very useful for classifying the pages that the list items below this 
heading point to. 



3 Document Representation 

In order to capture its structural information, we represented a document in the 
following way. First, the entire text of the document itself was discarded. Instead, 
we identified a set of pages that contain a pointer to the current page0 Each 
of these pages was turned into a separate training example using the following 
pieces of information: 

Anchor: All words that occurred in the anchor text of the link (between the 
opening <A . . .> and the closing </A> of the HTML link). 

Heading: All words that occurred in headings that structurally precede the 
hyperlink in the HTML document. This means a heading of type <Hi> is 
included iff it appears before the hyperlink and no heading of type <Hj> 
with j U i appears in the segment between the heading and the hyperlink. 
Page titles and titles for definition lists (<DT>) were also included as headings 
(with i = 0 and i = 7 respectively). 

Paragraph: All words of the paragraph in which the hyperlink occurs. Our 
method for determining the paragraph is somewhat heuristic and certainly 
not perfect. Pieces of text separated by <P> or an empty line are paragraphs, 
as are structural entities such as items in a list <LI>. 

The three features described above were each encoded as a separate set- valued 
feature |2| for the separate-and-conquer rule learner Ripper P, which achieves 
noise-tolerance through an extension of incremental reduced error pruning p. 
A set- valued feature may be viewed as an efficient encoding of a group of binary 
features that correspond to the occurrences of words in the document. We will 
also refer to set-valued features as feature sets and use the terms feature and 
word interchangeably. Each training example was labelled with the appropriate 
class information, which is the class of the page the link points to. 

^ We did this by scanning a collection of pages of Computer Science departments 
for all occurrences of an HREF that contains the address of the current page. In 
principle, this could also be performed on-line using a search engine like AltaVista 
that allows to query for pages that point to a given address. 
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4 Voting Schemes 

As discussed above, the training sets for Ripper consist of one examjAe for 
each hyperlink. From such a set, Ripper induces a set of unordered rule^ that 
discriminate the examples of each class from the examples of all other classes. 
At prediction time, Ripper selects among all rules that fire for a given example 
the one that has the highest confidence associated with it and uses it to classify 
the example. 

Quite frequently, however, several links point to the same page. As our goal 
is to the predict the class of a page (and not of each individual link) we can try 
to exploit the redundancy that is provided by such multiple links. In order to 
do so, we have to device strategies for combining the predictions of all hyper- 
links pointing to a page into a single prediction for the class of the page. We 
implemented the following five straight-forward techniques: 

Voting (Vote (all) ): The simplest technique is to give each link that points to 
a page one vote, and predict the class that receives the most votes. Ties are 
broken in favor of larger classes. Links that are classified using the default 
rule learned by RIPPER (i.e., the rule specifying that if no other rule applies, 
predict the majority class among all unexplained examples in the training 
set) are eligible to vote. 

Restricted Voting (Vote): It is reasonable to assume that there will always 
be a few links that are classified by the default rules. Thus we implemented 
another version of the voting scheme, where votes of such links are ignored 
and only links that were classified by non-default rules are eligible to vote. If 
a page only receives votes from default rules, it is classified with the majority 
class. 

Weighted Sum (Weight): We also associate a confidence score with each of 
ripper’s predictions, which simply consists of the Laplace-estimate 
of the probability that an example covered by the rule is positive (estimated 
on the training set). If the prediction originates from a default rule, it is 
assigned a score of 0. Such a score is computed for each possible class of 
each link. The Weight voting scheme simply returns the sum of all weights 
as the confidence score of the prediction. 

Weighted Normalized Sum (Norm): This voting scheme is identical to the 
previous one, except that the confidence scores are first normalized in a way 
that distributes a total weight of 1 among the different candidate classes 
for each link. This is necessary because the confidence score that is associ- 
ated with each class only depends on the number of positive and negative 
examples covered by the best rule that predicts this class and covers the 
example. Therefore the confidence scores associated with each class cannot 
be interpreted as class probability estimates unless they are normalized. 

® We have also experimented with ordered rule sets, but the results were usually a 
little worse. Besides, in “ordered” mode. Ripper treats one class as the default class 
and does not learn rules for that class. 
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Maximum Confidence (Max)-. The last combination method simply chooses 
the class prediction that receives the highest score over all links that point 
to the page to classify and predicts that class. This is an attempt to use only 
the most accurate of all applicable rules to classify a page. 

From a Machine Learning perspective, the problem can be viewed as com- 
bining the predictions for different training examples, for which it is known that 
they have the same class label. To the extent to which the predictions of the 
classifier for different training examples are independent of each other (which 
roughly corresponds to the extent to which the feature vector representation 
of the examples differ), it can be expected that combining the predictions may 
yield a performance gain 0. 

5 Experimental Setup 

We performed a series of experiments on 1050 pages of the WebKB domain. 
These pages are classified into one of the categories Student, Department, Fac- 
ulty, Research Project, Research Associate, Post Doc, and Course. Within these 
pages, 5803 hyperlinks point to another page within this set. Each of these is 
turned into a separate training example using the set-valued features described 
above. 

The pages/links were collected from four universities. All reported results are 
from a 4-fold leave-one-university-out cross-validation, i.e., for each experiment 
we combined the examples of three universities to learn a classifier which was 
then tested on the data of the fourth university. Because of the different test 
set sizes for each of the four results, we used micro-averaging for evaluating the 
accuracy of the predictors, i.e., we lumped the predictions from all four runs 
together and computed an accuracy measure on the entire set of predictions. 

More details on the experimental setup can be found in 0 , while the dataset 
is described in 0. 

6 Results 

6.1 Page Accuracy 

Table G] shows the accuracies measured for predicting the page labels. The rows 
list the different representation schemes, starting from the default prediction ac- 
curacy (using no features), to the classifier that uses all features. The columns of 
the table give the accuracy for each of the 5 implemented prediction combination 
techniques, starting with the voting scheme including default prediction, voting 
without default prediction, normalized weighted average, weighted average, and 
finally the maximum method (see section^. 

In terms of representation, it becomes apparent that using additional feature 
sets will generally result in higher accuracies. The exception to the rule is the 
Paragraph feature set. Whenever its features are added to a representation that 
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Table 1. Accuracies for classifying the 1050 pages using various methods for 
combining link predictors to page predictors. 





Combination Method 


Classifier 


Vote (all) Vote Normal Weight 


Max 


Default 


51.81 


51.81 


51.81 


51.81 


51.81 


Anchor 


67.52 


74.67 


74.38 


74.19 


74.76 


Headings 


60.48 


72.29 


72.38 


72.95 


72.95 


Paragraph 


63.05 


66.86 


66.86 


66.95 


66.29 


Anchor+Headings 


74.48 


85.33 


84.95 


85.14 


86.57 


Anchor+Paragraph 


68.00 


74.29 


74.00 


73.90 


74.67 


Headings-fParagraph 


70.48 


79.90 


80.19 


81.14 


81.33 


All 


74.19 


82.29 


81.71 


82.67 


83.24 



already includes the Anchor features, the result is a loss of predictive accuracy. 
A reason for this might be that these two feature sets are much less independent 
of each other than other pairs of feature setsfl The best results were achieved 
when relying only on the anchor text and the information from the headings. 

Among the five different techniques for combining the link predictions to a 
page prediction, taking the prediction with the maximum confidence is a clear 
winner. In 7 out of 8 runs, using this method gave the best results (shown in bold 
face). However, in general, the differences among the combination methods are 
not nearly as large as the differences among the different document representa- 
tions. The only exception is the voting scheme that also allowed the default rule 
to vote (first column of table . Apparently, the learned rules have a fairly low 
coverage, so that many of the links have to be classified using the default rule. It 
happens quite frequently that a few good rules are outnumbered by a number of 
default predictions. We have also found that the performance deteriorates simi- 
larly when default predictions are included into the weighted prediction combiner 
(results not shown) . The maximum technique remains mostly unaffected by this 
because it is unlikely that a default prediction receives the maximum confidence 
score among a number of competing link predictions (results also not shown) . 

6.2 Link Accuracy 

One question that remains unanswered by table Q is how much has actually 
been gained by combining the prediction of different links pointing to a single 
page. To investigate this question, we computed a weighted accuracy estimate 
by weighting each page with the number of links that point to that page. In 

Note, however, that even though the set of words occurring on the anchor text 
is a subset of the set of words occurring in its surrounding paragraph, the result- 
ing Anchor feature set is not a subset of the resulting Paragraph feature set be- 
cause the feature x_occurs_in_anchor_text is semantically different from the feature 
a;_occurs_in_paragraph, the former being more specific than the latter. 
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Table 2. Accuracies for classifying the 5803 links with various predictor combi- 
nation methods. 





Combination Method 


Encoding 


No 


Vote(all) 


Vote 


Normal Weight 


Max 


Default 


36.67 


36.67 


36.67 


36.67 


36.67 


36.67 


Anchor 


57.92 


58.80 


75.93 


75.56 


75.37 


76.05 


Headings 


43.34 


40.01 


66.62 


69.89 


70.77 


64.33 


Paragraph 


53.40 


55.09 


65.91 


65.81 


66.33 


58.59 


Anchor+Headings 


62.49 


61.66 


86.18 


85.46 


86.25 


83.22 


Anchor+Paragraph 


58.40 


59.23 


73.70 


73.67 


73.46 


71.81 


Headings-I-Paragraph 


58.50 


56.69 


78.67 


78.98 


80.30 


76.63 


All 


57.99 


61.43 


79.15 


77.74 


79.44 


79.20 



other words, all links that point to the same page perform an internal vote to 
decide upon a common classification for the page they point to. Each link of 
such a group is then classified with this common label. The resulting accuracy 
estimate counts the number of correctly predicted links over all links, and can 
thus be directly compared to the accuracies of the base classifiers that predict 
the class labels of each link independently. 

These results are shown in table |3 The first thing to note is a substantial 
difference between the independent classifier (first column) and the classifiers 
that rely on combining the predictions for different links for all methods except 
voting with inclusion of default predictions. Obviously, many mistakes could be 
corrected by combining the predictions of different links and thus being able to 
rely on good features that appeared in a different link pointing to the same page. 

Secondly, the differences between the voting scheme that includes default 
predictions (second column) and the voting schemes that ignore them is more 
remarkable than in table ^ We explain this with the fact that for pages with 
many incoming links, there are good chances that many of the links are classified 
by default rules, and that the combination of these predictions overrides the few 
“educated” guesses. With the voting schemes that ignore default predictions, 
the situation is the opposite: A few correct rule-base classifications can override 
many wrong default classifications and thus gain substantially in accuracy. 

It is also interesting to observe that in table El the Max prediction method 
(last column) is not as dominant as in table Q and, in some cases, it performs 
substantially worse than its competitors. The reason for this is that the maximum 
prediction is much less susceptible to variations in the number of link predictions 
that are combined to a single page prediction. If an erroneous link prediction has 
the maximum confidence score, it is used for predicting the class of the page. The 
voting and weighting methods, on the other hand, can make use of a number 
of unanimous predictions with lower confidence scores to override a prediction 
with a higher confidence score. Thus, it can be expected that pages with a higher 
number of incoming links are classified more reliably by voting or weighting. 
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Table 3. Recall and Precision for the page predictors. Recall is the percentage 
of pages that are not classified with the default rule and Precision is how many 
of these classifications were correct. 



Classifier 


Recall 


Precision 


Vote Normal Weight Max 


Anchor 


40.76 


83.64 


82.30 


82.48 


83.88 


Headings 


74.10 


88.05 


88.19 


88.95 


88.95 


Paragraph 


46.67 


75.71 


75.91 


75.92 


74.49 


Anchor -1-Headings 


85.90 


92.35 


91.91 


92.13 


93.79 


Anchor -l-Paragraph 


60.19 


83.23 


82.81 


82.59 


83.86 


Headings-I-Paragraph 


82.38 


88.44 


88.79 


89.94 


90.17 


All 


78.00 


86.94 


86.20 


87.42 


88.16 



while pages with a lower number incoming links are better classified by taking 
the prediction with the maximum score. As the latter category is more frequent, 
the page accuracies tend to be higher for the maximum prediction method, while 
the link accuracies tend to be higher for the voting and weighting schemes. 



6.3 Recall and Precision 

We have discussed above that many of the test examples are classified using 
the default rule and that it seems to be advisable to ignore these default link 
predictions for computing the page predictions. But what happens in cases where 
all links that point to a page are classified by default rules, i.e., no link contains 
any information that could be used for a justified prediction? In the experiments 
reported in the previous section, we have simply predicted the majority class 
Student for each of these pages. What if we ignore these predictions? It can be 
expected that the classification accuracy goes up at the expense of classifying 
fewer pages. This trade-off is commonly measured in terms of precision and 
recall. 

Table i lists recall and precision estimates that shed some light upon this 
question. Recall is the percentage of pages which were classified using at least 
one rule different from the default rule. Note that this estimate is the same for 
all combination methods because the underlying link classifiers are the same and 
hence the links that are classified with default rules are the same. Precision is 
the percentage of classified pages that were correctly classified. In general, the 
precision scores are much higher than the accuracy results of table O This is 
not surprising because accuracy can be viewed as a weighted sum between the 
precision on the recalled examples and the precision on the examples classified by 
default rules (which should be about the default accuracy, although the variance 
can be very high). The recall scores are more differentiated. Headings features 
not only have the highest, but also achieve this precision at significantly higher 
recall scores. 
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Table 4. Accuracy and number of 
features for using feature subset se- 
lection on the full-text classifier. 



Classifier 


^ Features 


Accuracy 


Link-Based 


8,075 


85.05% 


Full-Text 


20,322 


70.67% 



Table 5. Accuracy results for feature 
subset selection on the full-text classi- 
fier. 



^ Features 


Accuracy 


100% 


70.67 


50% 


73.90 


10% 


74.19 


5% 


74.76 


1% 


71.33 


0.1% 


54.67 



6.4 Comparison to Pull- Text Classifier 

We also compared the predictive accuracies of the link-based page classifiers to 
those of a classifier that uses the words occurring on the page as a feature. Ta- 
ble E] shows a comparison between the link-based classifier using all four feature 
sets and a full-text classifier in terms of predictive accuracy and the number 
of features used by both representations 0 The link-based classifiers discussed 
in this paper are considerably more accurate while using less than half of the 
number of features. 

An obvious question at this point is, of course, whether the differences in 
accuracy are at least partly due to the different number of features. Could we 
produce a similar effect by employing feature subset selection on the full-text 
classifier? We have already seen, that in general, adding additional feature sets 
improves accuracy. On the other hand, it is also known that too many features 
can lead to overfitting. Table El shows the results for using only the top n% of the 
features of the full-text classifier (selected by entropy) . Feature subset selection 
results in some improvement, but the best result is still more than 10% behind 
the link-based classifiers. We have not checked whether feature subset selection 
would improve the link-classifiers as well. 



7 Conclusion 

Our results show that it is possible to classify documents more reliably with in- 
formation originating from pages that point to the document than with features 
that are derived from the document text itself. Furthermore, it proved to be 
beneficial to be able to exploit redundant information on the WWW by combin- 
ing multiple predictions (one for each hyperlink pointing to a page). However, 
we have shown this for one domain only, so our results can only be considered 
as preliminary. More experimental work in other domains must be conducted in 
order to establish a conclusive result. 

® The reported number of features is the total number of features in the entire dataset. 
Each of the four training sets of the cross-validation contained on average a little 
more than 80% of the features for both types of classifiers. 
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Although the encoding scheme we used is quite straight-forward, it illustrates 
that the use of information about the HTML structure of pages and about the 
structure of the WWW itself can be useful for improving text categorization 
on the WWW. The use of more elaborate representation schemes (e.g., distin- 
guishing different types of headings or even using an entire HTML-tree jO] as 
background knowledge) suggests itself as a rewarding topic for further research 
as does the use of relational learning techniques (see, e.g., I[fll()l l. We have al- 
ready performed preliminary experiments using linguistic phrases of the kind 
used in ^ as an additional feature set, but found that they did not make much 
difference |^. Such approaches also need to be investigated in more detail. 
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Abstract. The Web is full of information sources. Currently, retrieving 
useful information on the Web is a time-consuming process. In this paper, 
we propose a multi-agent learning approach to information retrieval on 
the Web, where each agent collaboratively learns its environment from 
user’s relevance feedback using a neural network mechanism. Our ap- 
proach makes it possible to discover information sources associated with 
useful information and then retrieve that information effectively. First, 
we present a framework of IR agent and its operation for our multi-agent 
learning approach. Secondly, we define the multi-agent IR system based 
on our approach and then describe its training procedure for collavora- 
tive information retrieval. Finally, we present the experimental results 
of our approach, comparing them to those obtained by the approach of 
traditional meta-search service. 



1 Introduction 

As the number and diversity of distributed information sources on the Web 
increases rapidly, there has been an increased demand for Web Information Re- 
trieval (IR) systems which help people search for useful information. Thus a 
number of automated Web IR systems for retrieving information on the Web 
have been developed. Conventional Web IR systems usually build database in- 
dices for each information in a single platform and use a well-known retrieval 
model such as the vector space model based on TFIDF algorithm P| . These ap- 
proaches however often cause massive bottlenecks and unacceptable access delays 
under highly competitive access situations as in search tools such as Lycos Q 
and WebCrawler P). This shortcoming becomes more severe as the amount of 
information stored as the form of database indices increases. This problem has 
led to meta-search services such as IBM InfoMarkelQ and MetaCrawlei0 that 
distribute IR task into several search tools. The traditional meta-search service 
broadcasts a user query to several search tools simultaneously and then merges 
the results submitted by these search tools and presents them to the user as 

^ The URL of IBM Infomarket is www.infomarket.ibm.com. 

^ The URL of MetaCrawler is www.metacrawler.com/index_text.html. 



D.J. Hand, J.N. Kok, M.R. Berthold (Eds.): IDA’99, LNCS 1642, pp. 499-|^^3 1999. 
[fc Springer- Verlag Berlin Heidelberg 1999 
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an HTML page with clickable URLs 0. Therefore, the traiditioanl meta-search 
service provides its user a layer of abstraction over multiple search tools and 
serves as a Web search interface. 

The traditional meta-search service broadcasts a user query to all accessible 
search tools even when the index of useful information may be located at only 
one or some of search tools, which results in ineffective and wasteful compe- 
tition for network resources with considerable communication cost. Therefore, 
the indiscriminant broadcast of traditional meta-search service is inefficient in 
the Web environment where the indices of information related to each other are 
often clustered according to the categories of search tools so that the useful in- 
formation will be given by only one or a few search tools. These considerations 
for the Web IR raise an interesting problem, which we call the multiple access 
problem of IR: If there exist many information sources like search tools that 
have or might have the seeking information, how many queries should be issued 
and to which sources should the queries be given? 

To solve this problem, we propose a multi-agent cooperative learning approach 
for the Web IR. In the proposed approach, a number of information agents inter- 
act effectively utilizing the distributed search tools and learn their environment 
from the user’s feedback so that they can retrieve useful information to the user 
efficiently from the widely spread information sources on the Web. We adopted 
the BackPropagation neural Network (BPN) ^ as the learning and generaliza- 
tion mechanism that each information agent uses to build up and employ knowl- 
edge about information resources relevant to user’s interests or preferences. The 
knowledge is acquired by means of user’s relevance feedback judgement. The 
multi-agent IR system constructed on our approach, for any query, locates the 
search tools which could give the information useful to user, and then retrieves 
that information as long as it is sufficiently trained based on the BPN learning 
mechanism. 

The remainder of this paper is organized as follows. In Section 2, we define an 
IR agent for our multi-agent learning approach and then describe its operation 
procedures. In Section 3, we define the multi-agent IR system constructed from 
a number of IR agents and search tools, and then describe its training procedure 
for effective information retrieval. In Section 4, we evaluate our approach with 
the performance measurement of an experiment system and compare the perfor- 
mance measurement to that obtained by the approach of traditional meta-search 
service. 

2 IR Agent 

The term agent system (or agent) has been increasingly used within information 
technology to describe various computational entities. Especially, the exceed- 
ingly voluminous and readily available information on the Web has given rise 
to developing agents as the computational entities for accessing, discovering and 
retrieving information. This agent for IR (IR agent) can be described as a self- 
contained problem solving entity usually (but not necessarily) equipped with 
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Fig. 1. Framework of an IR Agent 



internal knowledge, sensor, effector and data pertinent to IR problem; it also 
has its own built-in control mechanism |^. The following two subsections de- 
fines an IR agent and then describes its operation for information retrieval and 
training. 



2.1 Definition 

To solve the multiple access problem of IR, our approach has, as a problem 
domain, the environment where there exist a number of Web search tools. The 
Web search tool is an information source that receives a query and returns 
some information relevant to that query based on its own IR database indices. 
In this environment, each IR agent locates search tools associated with useful 
information and then retrieves that information from those search tools by com- 
municating with its neighboring IR agents if necessary. Each IR agent sends a 
given query to and then receives the information relevant to that query from its 
directly accessible search tool or neighboring IR agent, which we call a coopera- 
tor of that IR agent. Therefore, each IR agent can retrieve useful information for 
a given query from its cooperators by sending that query to them. Fig. H shows 
the main components of an IR agent and the control flows among them. Thus, 
an IR agent a is defined by the 6-tuple a=< QB, IM, RF, TG, LM, QS >, 
where each component is described in the following subsections. 



Query Broadcaster {QB) The Query Broadcast broadcasts a given query to 
all cooperators of its IR agent in order to receive all information relevant to that 
query from them. 



Information Merger {IM) The Information Merger merges the information 
submitted by the cooperators of its IR agent and then presents it to the issuer 
of query. 
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Fig. 2. Training and recall phase of BPN 



Relevance Feedback (RF) The Relevance Feedback receives the user’s judge- 
ment for the information presented by the IM for a given query q and then 
generates a binary vector representation Cg = (c^i, Cg 2 , • ■ ■ , c^m) where if S is an 
ordered set of all cooperators, S =< oi, 02 , . . . , om >, of the IR agent that the 
RF belongs to, then, for j = 1, 2, . . . , M, 

{ 1 if the information submitted by Ai for a given query q is judged to 
be relevant by the user, 

0 otherwise. 



Term Vector Generator (TG) The Term Vector Generator transforms a 
query g, which is expressed as a set of index terms by eliminating non-content 
words and stemming the plural noun to its single form and inflexed verb to 
its original form j7j, into a binary vector representation (called term vector) 
Sq = (sgi, Sq 2 , ■ ■ ■ , SgAf) where if T is an ordered set of all index terms, T =< 
tN > and q C T, then, for z = 1, 2, . . . , iV, 



Sqi — 



if G g 
otherwise. 



Learning Mechanism (LM) To learn from user relevance feedback and recall 
when retrieving information, each agent has its Learning Mechanism as the form 
of the neural network associative memory that is shown by the shaded rectangle 
in fig.n Backpropagation neural network (BPN) is used for this neural network 
associative memory to take advantage of its learning and generalization proper- 
ties. Fig. El shows that the BPN of each IR agent acts in two phases: a training 
phase and a recall phase. 

During the training phase, the input and the output layer of the BPN are set 
to represent a training pair (sq,Cg) where Sg is produced by the TG and Cg is 
produced by the RF for a given query. The well-known BPN learning procedure 





Multi-agent Web Information Retrieval: Neural Network Based Approach 



503 



Procedure l£arnmg(,BPN, T) -J I This procedure trains SPATwith T 
BPN : abackpopagationneuralnetvrork 
T : atrainir^ query set 

begin 

repeat 

for each training parr (s^, g T 
begin 

Apply ry, to the input k5fer of SPW , 

Calculate tie output ctBPN , 

Calculate tie error /between the output ofSPWandthe desuedoutput c, , 
A(^ustthe link-weight matrices of SPMn a way that minimizes /, 
end 

undlthe erroris acceptably small for all (s^, c,) g T, 

Store the link-weight matrices of SPAT, 
end 



Fig. 3. BPN learning procedure 



is performed for all training pairs made of the outputs produced by the TG 
and the RF for given training queries. Fig. 0 shows the overall BPN learning 
procedure for a set of training pairs T. The BPN learning procedure adjusts the 
link- weight matrices of BPN using the backpropagation learning rule 0, and its 
result is stored as the IR knowledge about user’s interests or preferences. 

During the recall phase, the input layer of the BPN is activated by applying 
the term vector produced by the TG for a newly given query. This activation of 
the BPN spreads from the input layer to the output layer using the link-weight 
matrices stored during the training phase. This spreading activatiorfl produces, 
as the output of BPN, a vector representation whose components are all between 
0 and 1. 

The learning mechanism using BPN is the core of IR agent and thus the pa- 
rameter configuration of BPN may have a crucial effect on the performance of 
our IR approach. We illustrate the configuration of BPN for the experiment IR 
system in Section 5. 



Query Sender (QS) The Query Sender sends a given query selectively to the 
cooperators of its IR agent according to the output of the BPN recall phase 
as follows: Let S be the ordered set of all cooperators, S =< ai, 02 , . . . , om >, 
of the IR agent that the QS belongs to, and let Og = (ogi, 0 ^ 2 , ■ ■ ■ , Oqm) be an 
output vector of the BPN recall phase for a given query q and let r be a tolerance 
constant^ such that 0 < r < 1. Then, for f = 1, 2, ..., M, the QS sends q to ai if 
and only if Oqi > t. 



® This spreading activation procedure is more detailed in (Hj. 

In the experiment system, we have used 0.75 as the tolerance constant. 
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2.2 Operation 

In this section, we explain how an IR agent is trained to retrieve useful infor- 
mation for a given query. First, we describe the procedure of how an IR agent 
retrieves information for a given query and then we describe a training procedure 
of an IR agent using user’s relevance feedback. In the followings, we describe the 
procedures using the key components of an IR agent explained in the previous 
section. 



Information Retrieval An IR agent a =< QB,IM,RF,TG,LM,QS > re- 
ceives a query expressed as a character string from the human user or some other 
IR agent, and then returns the information for that query as the following steps. 

Step 1: TG transforms a given query q into its term vector Sq. 

Step 2: LM (BPN) activated by Sg produces an output vector Og by its recall 
phase. 

Step 3: QS selects cooperators based on Og and sends q to the selected cooper- 
ators. 

Step 4-' I M merges all information submitted by the cooperators selected on Og 
in step 3, and presents it to the issuer of q. 

In step 2 and 3, an IR agent locates its cooperators that is expected to give 
useful information for a given query using the IR knowledge stored as the link- 
weight matrices of BPN and send that query to those cooperators to retrieve 
information. 



Training The training procedure of an IR agent acquires the IR knowledge 
about the user’s interests or preferences using the user’s relevance feedback. If 
a query is given from the user, the IR agent broadcasts that query to all its 
cooperators by QB, and then waits for the information from them. If all infor- 
mation is returned, the IM merges and displays that information to the user. 
The RF asks the user to mark the pieces of information judged as relevant to 
his/her query and then extracts the information about from where each marked 
piece of information was returned. This extraction produces the binary vector 
representation, which indicates which of the cooperators of an IR agent gave the 
information relevant to the user’s interests or preferences for a given query, as 
described in Section 2.1. The term vector and the binary vector representation, 
which are produced for a given query by the TG and the RF respectively, com- 
pose a training pair. The BPN of an IR agent is trained with the set of training 
pairs obtained from all the queries given by the user. Fig. 0 shows the entire 
training procedure of an IR agent a for the training queries given by the user. 
With no training, the BPN as the learning mechanism of an IR agent has a 
random initialization of its link-weight matrices so that for any query, its recall 
phase does not produce any heuristic knowledge about the location of useful in- 
formation. But, as the BPN is trained with the more training pairs, it produces 
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Pn»cedure-4rra!M3(o:) : 
ct ; an IR agent 

Let ct = < 05 !M, RF, TG, LM,QS> II LMis BPN 

begin 

fraimngpairsef 2) 
for each training queiy^ given byuser 
begin 

regenerates the term vector , 

0 S broadcasts 5 to all cooperators of cx . , 

Wait for all information returned fiom ah cooperators of ^ , 

IM displa 5 B all information from all cooperators of to the user , 

RF produces the binary vector representation ^ from the user’s relevance 
feedback, 

iramlngpalrset iraimngpairset , 

end 

Train LM by calliirg LeamingiLM, trainingpairsei ) , 

end 



Fig. 4. Training procedure of an IR agent 



the more heuristic knowledge. Eventually, after trained with sufficient training 
pairs, it can produce the heuristic knowledge about the location of useful infor- 
mation for any query because by the White theorem jSj, the probability of the 
BPN error exceeding any tolerance level goes to zero as the size of the training 
set increases. This potential is experimentally discussed in detail in Section 4. 

3 Multi-agent IR System 

In the previous section, we defined an IR agent and then described its operation 
procedures. In this section, we define a multi-agent IR system based on IR agents 
and their accessible search tools, and then describe the training procedure of 
multi-agent IR system for collaborative information retrieval. 



3.1 Definition 

From some search tools as information sources and IR agents retrieving infor- 
mation from those search tools, a multi-agent IR system is constructed, which 
may be defined as follows: 

Definition: A multi-agent IR system is a 3-tuple M =< A, S,R > where A is a 
set of IR agents. S' is a set of search tools and i? is a binary relation on A x (AnS) 
such that < x,y >£ R if and only if y is a cooperator of x. 

As it can be noticed, a multi-agent IR system M =< A,S,R > can be rep- 
resented as a directed graph diagram where the elements of A are IR agent 



O : Agent 



; Search tool 



Fig. 5. Graph diagram of a multi-agent system 



nodes, the elements of S are search tool nodes and < x, 1/ >G i? is an arrow (di- 
rected link) from x to y. For example, let M =< A, S, R >, where A = ai, 02, 03, 
S = si, S2, S3, S4, S5, se, S7 and R=< ai, si >, < ai, 02 >, < ai, 03 >, < 02, S2 >, 
< 02,53 >, < 02,54 >, < 03,55 >, < 03,55 >, < 03,57 >. Then, M is repre- 
sented as a graph diagram of fig. El where any arrow from a node representing 
an IR agent outgoes to a node representing a cooperator of that IR agent. 

3.2 Operation 

In this section, we explain how a multi-agent IR system is trained to collabora- 
tively retrieve useful information. First, we describe a training procedure based 
on the definition of multi-agent IR system and then we explain about the col- 
laborative information retrieval of multi-agent IR system. 



Training Using the training procedure of each IR agent and the definition of 
multi-agent IR system, we describe the training procedure of a multi-agent IR 
system. For an IR agent to be trained, any of its cooperators that is not a search 
tool should have been already trained. If not so, the information submitted by 
some cooperators of that IR agent can be inaccurate and thus that IR agent 
can be trained with the inaccurate information. Based on this restriction, fig. El 
shows the training procedure of a multi-agent IR system M . For example, for 
the multi-agent IR system represented as the graph diagram of fig. El the IR 
agents are trained in the order of 02, 03, a\ or 03, 02, a\ by the procedure of 
fig. El In this procedure, we assume that the directed graph diagram representing 
a multi-agent IR system is acyclic. The training method in the multi-agent IR 
system represented by a cyclic directed graph diagram is also being developed, 
but we will not deal with this work in this paper but put it to the future work. 



Collaborative Information Retrieval Each IR agent in a multi-agent IR 
system sends a given query to some of its cooperators according to its trained 
BPN and then presents the information submitted by them to the issuer of 
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Proceduie Mrrains(M) : 

M: a miiti- agent I R system 
LetM= <A, 5 R> 

begin 

ageyit5et(-A , 
while agertsef ^ 0do 
begin 

untrained^ageTisef , 
for eachaeuntrained ; 

or ce^fbr all such that c>e^ then 

begin 

Train a bycalling ATrainsia ) , 

agen(set<-ager^^t - {a) , 

end 

end 

end 



Fig. 6. Training procedure of multi-agent IR system 



that query by the information retrieval steps of IR agent described in Section 
2 . 2 . For example, let us assume that the multi-agent IR system represented 
as the graph diagram of fig. El was trained with sufficient training queries by 
the training procedure of O If a human user gives a query q whose relevant 
information is indexed in the search tool sg to an IR agent oi, ai sends q to 03, 
which is determined by oi’s BPN and then 03 also sends q given by oi to sg, 
which is determined by os’s BPN. Then, if sg submits the information relevant 
to q to 03, 03 presents them to ai and then a± presents them submitted by 03 
to the human user. As a result, the IR agents in a multi-agent IR system can 
collaboratively retrieve useful information without exhaustively traversing all 
search tools in that multi-agent IR system by their learning mechanisms. This 
may result in the significant improvement in terms of communication cost and 
also make it possible to filter out the information irrelevant to the user’s interests 
or preferences, which will be experimentally discussed in the following section. 



4 Experiment 

We evaluated the performance of our multi-agent learning approach to IR on the 
popular search directories of Y ahoo\ AToreoQ. Y ahoo\ Korea provides hierarchi- 
cally organized directories in Korean language according to various categories, 
which are both browsable and searchable. We chose seven directories from them, 
each of which serves as a search tool that retrieves the descriptions of docu- 
ments relevant to a given query for its category. On these seven directories of 



® The URL of Y ahoo\ Korea is http://www.yahoo.co.kr. 
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Y ahoo\ Korea, we constructed a multi-agent IR system composed of three IR 
agents each of which has its three cooperators as shown in fig. 0 In this figure, si, 
S2, S3, S4, S5, sq and sy represent the chosen seven directories of Y ahool Korea 
for respectively Medicine, Computer Science, ElectricalEngineering, Mechan- 
ics, Physics, Chemistry and Biology categories. For this multi-agent IR system, 
we used the following configuration for BPN of each IR agent. 

• The number of input and output units is respectively 200 and 3: The number 
of input units is obtained from experiences as a number that is large enough to 
represent all training queries as different binary vectors. The number of output 
units represents the number of cooperators of each IR agent. 

• Only one hidden layer has been used because by the Kolmogorow theorem mil. 
three layers are theoretically sufficient to operate as an approximate associative 
memory. 

• The number of hidden units was set to 125: The number of hidden units is also 
obtained from experiences as a small number as possible to succeed in training 
each IR agent. 



We extracted 200 representative keywords from short descriptions of documents 
provided by the chosen seven directories of Y ahool Korea and used these key- 
words as training queries. We gave all the 200 training queries to each IR agent 
of our multi-agent IR system and then trained our multi-agent IR system by the 
procedure of fig.0 For comparison, we implemented an another IR system based 
on the indiscriminant broadcast of traditional meta-search service on the same 
directories of Y ahool Korea as those in our approach. For performance test, 50 
test queries were used. The test queries were generated as follows. First, we ran- 
domly selected 50 documents from the document descriptions provided by the 
chosen seven directories of Y ahool Korea. Next, we manually extracted terms 
and phrases from those 50 documents. Finally, we randomly chose 50 ones from 
those terms and phrases, and used them as the test queries. These test queries 
were given to both the root IR agent (ai in fig. of our trained multi-agent 
IR system and the traditional meta-search service system to retrieve informa- 
tion. We evaluated the documents retrieved for each test query and considered 
the ones interesting to us as relevant. We used the precision, the recall and the 
communication cost as the measurements of performance, which are defined as 
follows. 

the number of relevant documents retrieved 

precision = — ^ ; — 

the total number of documents retrieved 



recall = 



the number of relevant documents retrieved 
the total no. of relevant documents existing in the seven directories 



communication cost = the total number of query passes to retrieve information 

Results obtained by evaluating the entire set of 50 test queries for various dimen- 
sions of the training query set are reported in fig. Q In this figure, the average 
values of the results from the 50 test queries are used. 
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Fig. 7. Experimental results 



4.1 Precision 

Fig. 0(a) shows the precision curve for our multi-agent IR approach. The hor- 
izontal line across this curve indicates the average precision in the traditional 
meta-search service approach. As it can be noticed, the precision of our approach 
gets better as each IR agent possesses larger amounts of knowledge from the user. 
Especially, when the size of training query set is over 150, our approach is always 
better than the traditional meta-search service approach in terms of precision. 
We may suppose that as the size of training query set goes over 200, the pre- 
cision approaches 100%. However, the precision did not increase any more but 
converges on 96% in this experiment. The most probable explanation for this 
result is in the tolerable error permitted by the BPN learning procedure. 



4.2 Recall 

Fig. Cl-(b) shows the recall curve for our multi-agent IR approach. In this graph, 
we used a recall ratio to the traditional meta-search service approach, which is 
defined as follows. 

recall in our approach 

recall ratio = — : 

recall in the traditional search service approach 

Because the traditional meta-search service always retrieves information from 
all search tools, the recall of our approach cannot surpass that of the traditional 
meta-search service approach. Therefore, the recall ratio is a good measurement 
for comparison. From this graph, we can notice that the recall ratio is also 
proportional to the amount of knowledge of each IR agent. When each IR agent 
was trained with more than 200 training pairs, the recall ratio was kept close 
to 95%. As a result, when each IR agent was sufficiently trained, the precision 
showed a gain of nearly 16% and the recall showed a loss of 5% in comparison 
with the traditional meta-search service approach. In many cases, IR tasks suffer 
from too- much problem: for a given query, the IR system in question often 
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returns too many pieces of information to deal with in a reasonable manner 
na. This means that precision is more important measurement than recall in 
many IR situations. Therefore, more improvement in precision than loss in recall 
has significant meaning in the IR performance. 

4.3 Communication Cost 

The communication cost is depicted in Fig.0-(c). The communication cost of the 
traditional meta-search service approach is always 7 because it always broadcasts 
a given query to all seven directories of Y ahool Korea. It is interesting to note 
that the communication cost obtained by our approach is always quite smaller 
than the one obtained by the traditional meta-search service. For more than 150 
training query pairs, the communication costs do not show the big difference. 
This means that the communication cost converges earlier than precision and 
recall. The result of communication cost implies that our approach heavily re- 
duces the workload for network resources and thus considerably mitigates the 
bottleneck problem under highly competitive IR situations. 

4.4 Training Cost 

The BPN learning procedure used for training each IR agent requires a num- 
ber of training cycles for every training pair in order to converge to a stable 
state minimizing errors. In our experimental multi-agent IR system, the number 
of training cycles during the training phase of BPN of each IR agent did not 
exceed 56 for all dimensions of the training query set . This means that the num- 
ber of feedback propagations that occurred during the training phase of BPN 
was not more than 11,200 (56x200) in each IR agent. Actually, each training 
phase was finished within 39 seconds in the experiment on the UNIX SPARC 
station. Therefore, this additional training cost is acceptable when the significant 
improvement of IR performance is considered. 

5 Conclusion and Future Work 

In this paper, we proposed a multi-agent learning approach to the Web IR us- 
ing a neural network. Our multi-agent learning approach provides a method for 
locating the information sources that will give useful information and then re- 
trieving that information. Our approach also can capture the knowledge about 
user’s interests or preferences for efficient and effective IR. From the results of 
experiment, we identified the notable improvements of performance at the ex- 
pense of some training cost in our approach, as compared with the traditional 
meta-search service. We also observed that the learning and generalization ca- 
pability of a BPN could be effectively utilized as an internal learning mechanism 
of an IR agent. Since none of the current generation of IR methods incorporates 
a learning mechanism of artificial neural network into cooperative multi-agent 
IR techniques, the experiment system for our multi-agent IR approach may be 
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an interesting application of artificial neural network. The work presented in 
this paper is an early stage of our research plan. Currently, we are investigating 
how the performance changes as various parameters change in the experimental 
system. We are also looking into an automated method to find an optimal pa- 
rameter setting. We plan to continue our research for developing more general 
training procedure feasible even for the multi-agent IR system represented as a 
cyclic directed graph diagram. We will also extend our work into an adaptive 
IR agent system that will retrieve information from highly dynamic informa- 
tion sources whose theme, content and structure are subject to asynchronous 
changes. 
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Abstract. Adaptive Information Filtering is concerned with filtering 
information streams in changing environments. The changes may occur 
both on the transmission side (the nature of the streams can change) 
and on the reception side (the interests of a user can change). The 
research described in this paper details the progress made in a proto- 
type Adaptive Information Filtering system based on weighted trigram 
analysis and evolutionary computation. The main improvements of the 
algorithms employed by the system concern the computation of the dis- 
tance between weighted trigram vectors and a further analysis of the 
two-pool evolutionary algorithm. We tested our new prototype system 
on the Reuters-21578 text categorization test collection. 



1 Introduction 

We live in what is often termed the “information age” . It might more appropri- 
ately be called the “data age”, for only relevant data is information, and finding 
relevant data among the ever greater accumulations of available data is becom- 
ing increasingly more difficult. One of the fields dealing with this problem is 
Information Filtering (IF). IF is the process of filtering data streams in such a 
way that only particular data are preserved, depending on certain information 
needs. The IF environment is the combination of data stream and information 
needs. When the data stream and the information needs are not changing over 
time the IF environment is said to be static. When, however, the IF environment 
is dynamic, as opposed to static, an adaptive information filtering (AIF) system 
is called for. An AIF system is an IF system capable of adapting to changes in 
both the data stream and the information needs. 

One of the essential ingredients in any information retrieval (IR) or IF system 
is its ability to match a query (in the case of an IR system) or a profile (in the 
case of an IF system) with the documents available for perusal. While optimally 
a semantical match should be performed, that is not currently feasible and we 
have to be satisfied with a syntactical match. A good general reference to the 
field of IR/IF is P]. 

The most widely employed syntactical representation of textual documents 
is based on term indexing (see for example jSl ) ■ In manual indexing keywords are 
manually assigned to a document, while in automatic indexing the frequencies 
of all the terms occuring in a document are indexed. Term indexing has several 
drawbacks, such as its sensitivity to spelling variations and errors, its static 



D.J. Hand, J.N. Kok, M.R. Berthold (Eds.): IDA’99, LNCS 1642, pp. 513-|^2H 1999. 
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nature (the terms need to be known beforehand which is fine for IR but not for 
IF) and its reliance on linguistic preprocessing, such as stop word removal and 
word stemming, to make it effective. 

Another approach which in the last decade has received quite a bit of atten- 
tion is based on the so-called n-gram analysis The n stands for a positive 
integer. The application of n-gram analysis produces an n-gram frequency vector 
which holds the frequencies of all the distinct character combinations of length n. 
In 1-gram analysis the occurrence of single letters is determined, in 2-gram anal- 
ysis that of pairs of letters, in 3-gram analysis that of triplets, etc. When talking 
about a specific value of n, especially for lower values of n, often its Latin name 
is used instead of the numeric value, so 2-grams are often called bi-grams or bi- 
grams, 3-grams trigrams, 4-grams quadgrams, but 7-grams usually just 7-grams. 
For example, the word “coconut” consists of the bigrams “co”, “oc”, “on”, “nu” 
and “ut” , all with a frequency of one except for “co” which has a frequency of 
two. The trigrams are “coc” , “oco” , “con” , “onu” and “nut”, all with a frequency 
of one. The use of n-gram analysis has many advantages over term-based sys- 
tems, such as being more robust when dealing with spelling variations or errors 
and not requiring linguistic preprocessing which facilitates the deployment of n- 
gram-based systems in multi-topic/multi-language environments |2]. However, 
also an n-gram-based system can potentially benefit from preprocessing, since 
for example when the stop word ‘the’ is removed, the trigram ‘the’ becomes of 
significance. 

In PI it was shown that term indexing — traditionally used in IR/IF systems 
— is in general not suited for AIF, but that weighted trigram analysis is. See 
P] for an example of a term-based AIF system for use in a restricted domain. 
A prototype AIF system based on weighted trigram analysis was introduced in 
0 and PI1|. For n < 3 n-gram analysis does not provide sufficient syntactical 
information (Z) and for n > 3 advanced sparse vector representations are required 
which will be employed in future versions of our AIF system. 

The matching technique used in the original prototype AIF system was based 
on the Euclidean metric, which is a special case of the Minkowski £p-metric, 
namely for p equal to two {p equal to one is called the Manhattan metric) . This 
paper details the advances made in the matching technique. An important im- 
provement is normalizing the weighted trigram vectors instead of the trigram 
vectors themselves. It also introduces the Manhattan metric as a possible al- 
ternative to the Euclidean metric in the prototype AIF system. For a general 
introduction to measurements in information science see 0. 

A crucial step in working with weighted trigram analysis is to find the right 
weight vector. Our first prototype AIF system introduced a novel two-pool evo- 
lutionary algorithm (EA) for optimizing weight vectors. EAs are a class of opti- 
mization algorithms which come in handy when no a-priori solutions to a specific 
optimization problem are available. They work by evolving a population of trial 
solutions using techniques inspired by evolutionary biology. For an easy intro- 
duction to evolutionary computation (EC) see chapter 4 of [Hj; for a more com- 
prehensive introduction to EC see 0. This paper provides a full derivation of the 
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two-pool EA, showing that it is a special case of a whole family of classification 
EAs. 

A new prototype AIF system based on the improved matching technique has 
been constructed. This paper describes the new system and presents the results 
of testing it on the Reuters-21578 text categorization test collection. Using a 
standard test collection will facilitate comparing these results with other case 
studies. The Reuters collection has embedded tags indicating common usage in 
text categorization tests. They were not suitable for our purposes which prevents 
our results from being compared to previous studies which did employ those 
tags. However, as the collection is readily available and later in this paper we 
describe how we obtained the training and test sets for our research, we facilitate 
conducting studies which can be compared to our results. 

The paper is structured as follows. In section 0 we give a global description of 
the complete system. In section Olwe describe the distance measures. The details 
of the two-pool EA are presented in section 2] In section 0our new prototype AIF 
system is explained, while section ini describes the Reuters-21578 test collection 
and the results of our experiments with that collection. Finally, section |7| gives 
our conclusions. 



2 Overview of the AIF System 

This section is meant to illustrate the working of the system as a whole without 
drowning the reader in all the details which are given later in this paper. The core 
of the system is the clustering cycle (see figure . The clustering algorithm uses 
a weight vector to compare the trigram frequency vector of a document with the 
prototype vectors of the clusters and decides in what cluster the document will be 
classified. Depending on the parameters of the cluster algorithm, the prototype 
vector of the chosen cluster will shift a bit in the direction of the newly presented 
document vector. The prototype vectors are initialized by averaging the trigram 
vectors of a number of documents belonging to each cluster (class). 

The weight vector and the parameters of the cluster algorithm (the cluster 
radius and the shift factor) are determined by the EA. So the EA works on a 
population of individuals each containing a chromosome with genes existing of 
the components of the weight vector and the parameters of the cluster algorithm. 
The fitness of an individual is determined by dividing the number of documents 
it has correctly classified by the total number of documents it has classified. 



3 Measuring Distance in Weighted Trigram Frequency 
Vector Space 

The performance of a matching technique is called its discriminating power. The 
higher the discriminating power, the better a technique is able to separate docu- 
ments which are semantically dissimilar and to group together documents which 
are semantically similar. In |2j it was shown that the combination of weighted 
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trigram 

weight vector and document 




Fig. 1. Schematic overview of the adaptive IF system. 



trigram analysis (each trigram is assigned a weight indicating its relative impor- 
tance) and the Euclidean distance metric has sufficient discriminating power for 
document classification. 

The size of the alphabet used will be indicated with |a|. Consider two doc- 
ument vectors di and d,2. Let / = (/i, /2, •••,/«) and g = (51,52, • • • , 5 «) with 
n = |a|^ be the corresponding trigram frequency vectors for these documents. 
Let w = (wi, W2, • • • , Wn) with Wi > 0, i = 1, ■ ■ ■ ,n he the weight vector giving 
the relative importance of the different trigram frequencies. The weighted tri- 
gram vectors x = (xi, ■ ■ ■ ,Xn) and y = (51, • • • , yn) corresponding to / and 5 
respectively are defined as follows: Xi = fiWi and yi = giWi for i = 1, • • • , n. 

In PI it was argued that the trigram frequency vectors had to be normalized 
to prevent the length of a document influencing the distance metric. However, 
if we want to measure the distance between two weighted trigram frequency 
vectors then those are the vectors that need to be normalized, not the trigram 
frequency vectors. This can be accomplished by introducing x = (xi,---,x„) 
with Xi = XilYTj=\ and yi = yi/YTj=i Vj- 

The match between di and c?2 can then be estimated by applying the Eu- 
clidean distance metric to x and y: 




Y^{xi-y,Y 

2=1 



p{x,y) 



( 1 ) 
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An alternative to the Euclidean metric is the Manhattan metric. Using it the 
match between di and d .2 can be estimated as follows: 

n 

P'{x,y) = ^%-yi\ (2) 



4 Applying Evolutionary Computation to Classification 

In our AIF system the classification of a document vector is dependent on the 
weight vector being used. We determine this vector by using an evolutionary 
algorithm (EA). In this section we will consider the development of classification 
EAs (CEAs) more generally, but for our concrete system the members of a 
population are weight vectors, the score of a member is the number of correctly 
classified documents and its age is the total number of documents it has classified. 

The set of objects to classify will be denoted with S and the number of objects 
in S with jS”!. For short a will stand for an object and c{a) for the class cr maps 
to. The set P — {Pi, P 2 , • • • , Ppopsize^ population of trial solutions with 

pop_size a positive integer. For the purpose of indexing the population members 
we define i as an integer between 1 and popsize. Two essential components of 
any CEA are the evaluation of all the population members and, based on that, 
the evolvement of the population. The evolvement component will be denoted 
with EVOLVE{P). The evaluation component will be denoted with EVAL{S, P) 
and is defined as follows: 



EVAL{S,P): VP, e P determine P/TAP5'5'( 5, P,) (3) 



The fitness of a trial solution given an object set is the average score of that 
trial solution on classifying all the objects in the object set. The range of the 
fitness is from zero to one with zero being the worst (all classifications incorrect) 
and one the best (all classifications correct). The fitness function is defined as 
follows: 

. s 'Evr.pc. RESULT(a,PA 
PITNESS ( 5, P, ) = (4) 

Pl 

The result of classifying an object given a trial solution is either zero (incor- 
rect) or one (correct). The result function is defined as follows: 



RESULT {a, Pi) 



0 classify {a, Pi) ^ c{a) 

1 ii classify (a, Pi) = c{a) 



( 5 ) 



The result function works by comparing the actual mapping of an object to 
the mapping of that object computed using a trial solution. The function which 
performs that computation is defined as: 



CLASSIFY {a, Pi) = the class a maps to using Pi (6) 



The CEA can then be defined as given in Algorithm Q 
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Algorithm 1 Static object set 

initialize S, P 
EVAL{S,P) 

while (not termination condition) do 
EVOLVE(P) 

EVAL{S,P) 

end 



4.1 Expanding Object Set 

If S expands in time we can simply execute Algorithm Rafter each expansion to 
find a mapping from object space to class space at any given time. If the set of 
objects is smaller than the object space and represents it better as it expands, 
then the mapping found by the CEA will better approximate the mapping from 
object space to class space as time progresses. In this case it is likely that the 
mapping found at any particular time is a good approximation of the mapping 
to be found the following time and therefore would make a good starting point 
for the next search. Time will be denoted with r and the object added to S at 
T — T with (Tf . The new algorithm is given as Algorithm 0 



Algorithm 2 Expanding object set 

T <— 1 

initialize S, P 
repeat forever 
EVAL{S,P) 

while (not termination condition) do 
EVOLVE{P) 

EVAL(S, P) 

end 

r <— r + 1 
add <Jt to S 

end 



4.2 Shifting Window 

There are a number of reasons why we may not want to use an ever expanding 
set of objects to find a mapping from object space to class space. For one, this 
requires an ever increasing amount of computational resources, both in terms 
of memory and in CPU cycles. And secondly, the mapping may change over 
time so that obtaining c((t)’s might prove to be an expensive operation or it 
is even possible that old c(cr)’s are not obtainable at all. In this case we can 
impose a shifting window on S limiting the number of objects to be used in the 
evolutionary process at any given time. The size of the shifting window will be 
indicated with w. The new algorithm is given in Algorithm El 
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Algorithm 3 Shifting window 
r ^ 1 

initialize S, P 
repeat forever 
EVAL{S, P) 

while (not termination condition) do 
EVOLVE{P) 

EVAL{S, P) 

end 

r <— r + 1 
add (Jt to S 

if (r > w) then remove ar-w from S 



4.3 Age 

One thing we lose by employing a shifting window is the information on how well 
trial solutions performed on objects no longer contained in S. And the smaller 
w is, the greater this loss. To preserve this information in our shifting window 
CEA we introduce the concepts of member age and member score. The age of a 
member is defined as the number of population generations since the creation of 
that member and is denoted with The score of a member is defined as the 

number of correct classifications it has made since its creation and is denoted 
with fitness function is now defined as: 

r>score 

FITNESS{P.) = -j^ . (7) 

And the evaluation component becomes: 

EVAL{S, P) : VP, G P : Vo- e S' : 

page ^ page ^ pseore ^ pseore p RESULT{a, Pi) 

and compute PITNESS(Pi) 

4.4 Two Pool 

One of the consequences of the new way of determining fitness is that as the age 
of a member increases so does its statistical reliability in approximating the true 
fitness of a member, that is, its fitness if computed using S equal to the entire 
object space. If, when producing offspring, the new member’s score and age are 
set to zero, as opposed to basing them on those of its parent(s), its statistical 
reliability plunges and time is needed to recover some measure of reliability. In 
that case it is necessary to prevent the new member from participating in the 
evolution process until it matures. This can be accomplished by splitting the 
population into two pools, namely a child pool P'^ and an adult pool P“ with 
P = P'= U P“, jP'^’l the number of members in P'’ , |P“| the number of members 
in P“ and age-threshold the age at which members are moved from P'^ to P“. 
The resulting algorithm is given in Algorithm 01 
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Algorithm 4 Two pool 

r ^ r + 1 
initialize S, 
repeat forever 
EVAL{S, P) 

while (not termination condition) do 
if (|P“| > 0) EVOLVE{P°-) 

EVAL{S, P) 

'iPi £ : if (p“se _ agejhreshold) move Pi from P'^ to P“ 

end 

r <— r + 1 
add a-T to S 

if (r > w) then remove (Jt-w from S 



5 A New Adaptive Information Filtering System 
Prototype 

The prototype AIF system introduced in [2j was completely rewritten incorpo- 
rating the new distance measures presented in sectional and using the two-pool 
CEA derived in section g] Another change is that the weights are expressed in 
floating point numbers instead of integers, allowing much more gradual change 
during mutation. A significant improvement has been made in how the system 
measures its performance; in addition to tracking the lowest, average and high- 
est fitness values, the new system also measures the actual system performance. 
System performance is expressed in correct classifications per document, ranging 
from zero for all documents classified incorrectly, to one for a perfect classifica- 
tion record. While the fitness values offer insight into how the CEA is doing and 
can, to a certain degree, be indicative of how the system is performing, system 
performance is by far the best basis for comparisons. 

In order to accurately measure the performance of the system thousands 
of documents need to be classified. The c(cr)’s should to be provided via user 
feedback. Until the system is ready for trial deployment, however, it will be 
necessary to simulate this user feedback. One way this can be accomplished is 
by employing a test set of documents for which the c(cr)’s are known. The CEA 
is a special case of Algorithm 0 namely with shifting window size set to one and 
with a termination condition such that the inner loop is executed only once for 
each outer loop. The population members each consist of their score, their age, 
the radius parameter used by one of the CLASSIFY functions and a full set of 
weights. The system can then be described as given in Algorithm 0 

The prototype vectors representing the category cluster centers are initialized 
by calculating for each the average of a certain number of trigram vectors. The 
initialization of the population is done by setting the scores and ages to zero, 
the radius to a random value within a user specified range and assigning positive 
random values to the weights. 
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Algorithm 5 AIF two pool 

r ^ 1, initialize prototype vectors 
initialize P‘^ 
repeat forever 
EVAL{ar,P) 

if (|P“| > 0) EVOLVE{P°^) 

'iPi € if (p“3® = ageJ,hreshold) move Pi from P“ to P“ 
r <— r + 1 

end 



There are two CLASSIFY {a, Pi) functions. The one determines if the dis- 
tance between cr and the closest class to cr is within the maximum class radius as 
set in the parameter file. If so, it returns the index of that class, if not, it returns 
a value indicating no class was close enough. The other simply determines the 
class closest to a. The distance functions used are the Manhattan distance func- 
tion p'(x,y) and the Euclidean distance function p(x,y) as derived in section 
01 

There are two evolvement algorithms, one with crossover (resulting in two 
children produced by two selected parents) and one without crossover (result- 
ing in one child which is a copy of the selected child). In both algorithms the 
generated child(ren) are mutated (see below) and the weakest adult(s) is (are) 
removed for the generated child(ren). The form of crossover employed is uniform 
crossover, in which each gene of a child has an equal chance to come from either 
parent. Mutation is performed by adding with a certain probability Gaussian 
noise to the genes of a member. Parent selection is done by selecting fitter mem- 
bers with an exponentially higher probability; this causes selective pressure. If 
no adult gets selected by this process, the fittest adult is selected by default. 

The user definable parameters for the new AIF system are as follows. For the 
CEA the user can specify the size of the population (positive integer), the age 
threshold (positive integer), the number of adults to replace after each evalua- 
tion (positive integer), the selective pressure rate (real value between 0 and 1), 
crossover (enabled/disabled), the chance that a gene gets mutated (real value 
between 0 and 1) and the amount of Gaussian noise used during mutation (real 
value between 0 and 1). Note that after two times the age threshold genera- 
tions, the size of the child pool is the age threshold times the number of adults 
to replace after each evaluation, assuming the total population size is larger or 
equal. So, for example, if the size of the population is 100, the age threshold 
10 and the number of adults to replace after each evaluation is 4, then after 
20 generations the child pool will stabilize at size 40 and the adult pool at size 
60. For the clustering algorithm the user can specify the distance function to be 
used (Manhattan or Euclidean), the number of vectors used for averaging during 
the initialization of the prototype vectors (positive integer) and the range of the 
radius values (positive real values). For each experiment the user can further 
specify the number of clusters and the size and number of passes for the training 
and the test set. 
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6 The Reuters-21578 Text Categorization Test Collection 

The experiments conducted with the first prototype of the AIF system used In- 
ternet newsgroup articles from a number of carefully selected moderated news- 
groups. This is not satisfactory for two reasons. First, while the moderation 
process tends to eliminate most of the personal messages, it allows a lot of meta- 
messages, such as announcements, the topics are often interpreted very broadly 
and the article contents can be relevant to multiple topics. And secondly, unless 
one carefully archives, indexes and makes available, the articles used in an exper- 
iment, it is not possible for other researchers to reproduce reported experimental 
results. A collection of documents without the above mentioned drawbacks was 
desired to facilitate experimentation with the new AIF system. The construc- 
tion of a large high-grade text categorization test collection is extremely time 
consuming, therefore we decided to use a standardized collection instead of cre- 
ating one of our own. The collection we selected was the Reuters-21578 text 
categorization collection. 

The documents in the Reuters-21578 collection appeared on the Reuters 
newswire in 1987. The collection is downloadable from David D. Lewis’ profes- 
sional home pageQ- The documents in the Reuters-21578 collection are in SGML 
format and tagged for the purpose of splitting into training and test sets as used 
in published studies concerning text classification. This was done to allow the 
results of different studies to be compared. For our purposes, however, a subset 
of the collection was needed. First of all it was required that a document be 
indexed with only one topic, which limited the subset to 9494 documents. And, 
secondly, it was required that the document be a regular text document which 
further limited the subset to 8654 documents. From that subset only those docu- 
ments belonging to the ten most frequent topics in the subset, as listed in Table 
n were employed. For the purpose of trigram analysis, a document is treated as 
a string of characters. Letters are handled case-insensitive and all other charac- 
ters are interpreted as the space character. Any sequence of spaces is replaced 
by a single space. Thus the trigram alphabet consists of 27 characters, namely 
‘a’ through ‘z’ and the space delimeter. The number of distinct trigrams is then 
273 = 19683. 

We did experiments using a growing number of the selected topics in Table 
[Dfrom the Reuters-21578 collection. Our results are given in Table El The ex- 
periments used the Manhattan metric as distance measure. It was decided to 
classify in closest cluster regardless of distance to that cluster. We averaged 30 
document vectors in order to properly initialise the prototype vectors. For each 
experiment the training set was comprised of thirty document vectors for each 
topic and the test set of fifty document vectors for each topic (except for Money- 
Supply the sample was slightly smaller). The population size was 200, the age 
threshhold 25, the number of adults which got replaced each generation was 2, 
the selective pressure was 0.1, crossover was enabled, the mutation chance was 
0.5, the mutation rate was 0.00001 and the training set was presented 20 times. 

^ currently at http://www.research.att.com/home/lewis 



Adaptive Information Filtering Algorithms 523 



Table 1. Subset of Reuters-21578 used in experiments 



tag 


topic 


size 


acq 


Mergers /Acquisitions 


2125 


coffee 


Coffee 


114 


crude 


Crude Oil 


355 


earn 


Earnings and Earnings Forecasts 


3735 


interest 


Interest Rates 


211 


money- fx 


Money /Foreign Exchange 


259 


money-supply 


Money Supply 


97 


ship 


Shipping 


156 


sugar 


Sugar 


135 


trade 


Trade 


333 



Table 2. Test set results (percentage correctly classified) 



Topics 


Unweighted 


Average 


Best 


System 


Coffee, trade 


99.0 


99.5 


100 


100 


-t crude 


93.3 


98.6 


100 


98.7 


+ money- fx 


89.5 


96.6 


98.1 


96.5 


-1- sugar 


89.2 


97.0 


100 


95.6 


+ money-supply 


83.1 


93.9 


100 


89.7 


-1- ship 


78.5 


89.2 


96.3 


85.9 


-I- interest 


77.2 


88.2 


93.7 


84.9 



The first column of Table El lists the test set results for classifying with- 
out the use of weights. The second column lists the average adult population 
member score, the third column the best adult population member score and 
the fourth column the system score. The results show that the new matching 
technique presented in this paper allows even unweighted trigram analysis to 
perform reasonably well for a small number of topics. When the number of top- 
ics increases the superiority of weighted trigram analysis is clearly demonstrated 
by the system scores. Preliminary results indicate that when progressively more 
training time is allocated as the number of topics increases, the test set results 
for weighted trigram analysis are greatly improved. 

7 Conclusions 

In this paper we described a complete revision of the prototype AIF system 
introduced in |S| and mg. From the results presented in section El we can draw 
a number of conclusions. First of all, the discriminating power has been signifi- 
cantly increased as a result of the new matching technique presented in section El 
Secondly, the combination of the new matching technique and the AIF two-pool 
CEA delivers greatly improved system performance. As a result of the improved 
system performance it is now feasible to experiment with eight and more clusters 
instead of only four clusters (more than four clusters caused strong degradation 
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of performance in the old system). But while the case for generalization and 
scalability has been further strengthened, there is still a lot of work to be done 
to prove it conclusively. 

Obviously a lot more experimental data is needed. A major hurdle has been 
the amount of computational time required to perform an experiment, as well 
as huge long term storage and RAM requirements. The recent move in long 
term storage from huge sparse trigram frequency vectors to compact trigram 
frequency vectors resulted in a reduction in the amount of storage space required 
of between 90 and 95 percent. We are now looking into doing the same for the 
internal representation of the trigram frequency vectors and possibly the weight 
vectors too, which should reduce RAM requirements comparably. It should also 
reduce the amount of computational time significantly allowing much larger 
experiments. Another area we have to concentrate on is the fine tuning of the two- 
pool EA. Other potential improvements to our AIF system we will investigate 
are support for n-grams with user definable values of n and larger alphabets. 
Further in the future we will be looking at more advanced clustering algorithms 
which will be able to add new clusters and in which each cluster would have an 
independent radius. 
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Abstract. We present in this paper a video representation and retrieval 
model using the conceptual graphs and based on the characteristics of the 
perception of video content in time. We propose a new semantic structure 
for video content representation based on the notion of event. This new 
vision of content representation takes into account the different levels of 
abstraction of video content. On the other hand the model covers the 
temporal dimension of video and so presents a temporal model based on 
two time dimensions defined as video and story time. 



1 Introduction 

Video is a media which contains an enormous variety of information. The tem- 
poral dimension of video documents due to their particular image presentation 
frequency creates the illusion of animation and so makes the perceivable content 
of video documents reach to a very high degree comparing to still image and 
text. Thus the modeling of the video content information needs a comprehen- 
sive attention in order to take into account the maximum of this information 
and also to present a suitable schema permitting the efficient utilization of this 
information. The present study aims to provide a video model to be used in 
a video information retrieval system. In such systems despite the information 
concerning the general characteristics of a document (as title, author, etc.), we 
insist on the content information of the documents. In order to define such in- 
formation, we have based our work on the study of the characteristics of the 
perception of video content in time. This study revealed us some characteristics 
of the video content which we afterwards applied as the principles of our model. 
The introduction of the notion of event as the unit of video content description 
and also the consideration of two dimensions of timewideo time and story time, 
are amongst these principles. As we will explain later, we believe the conceptual 
graphs as an adequate formalism to describe the proposed video data model and 
also to support the logical basics of the retrieval procedure. 

D.J. Hand, J.N. Kok, M.R. Berthold (Eds.): IDA’99, LNCS 1642, pp. 525-|^2SI 1999- 
[fc Springer- Verlag Berlin Heidelberg 1999 
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One of our important viewpoints in the conception of the present model has 
been to provide a general model: a model to be simply adaptable to different types 
of video and application contexts. To fulfill this objective we propose the model 
in two distinct parts concerning the generic and specific aspects of videos. The 
flexible construction of the event unit has particularly facilitated the achievement 
of the recalled generality. 

In the following, we first give in section 2 a concise review of the existing 
approaches of video modeling. In section 3 we describe the features of the video 
content perception. These features lead to the principles of our model described 
in section 4. Section 5 provides an introduction to the conceptual graphs formal- 
ism and their relation to information retrieval. We present the description of the 
video model using the conceptual graphs in section 6. An exemple showing the 
interest of our approach is given in section 7. Finally we conclude in section 8. 



2 Related Works 

In order to give a concise overview of the principal approaches adopted in existing 
video models and to note their inadequacy in serving as a general representation 
and retrieval model, we classify them into three main categories. The first cate- 
gory and the most well known consists of the models based on the hierarchical 
cinematographical structure of video documents Q, d, PI, which insist on the 
cutting of video into scenes, shots and images. The utilization of this structure 
is adequate for video documents with a semantic structure which follows the so- 
called cinematographical structure, for example the television news. For other 
types of video documents where such a structure does not exist, these models 
are restrictive. Another category are the models based on the representation of 
the objects P] and spatio-temporal relationships between the objects |5|. These 
models are restricted to object description and do not permit more elaborate 
semantic description of video content. Between the existing approaches of video 
modeling, stratification jSj and |Z] seems more oriented to semantic description 
of video. However, as stratification has been initially proposed for editing and 
annotating video systems, it does note provide a precise definition of what the 
content of each strata may be, how this content description is organized and how 
it is possible to retrieve a video document by describing its content through a 
set of strata. We will see in this paper how the semantic structure we define for 
video documents overcomes the above restrictions. To understand the founda- 
tions of our proposed model, the following section describes the characteristics 
of the video content perception which we have taken into consideration. 

3 Video Content Perception 

A video document is a sequence of images played at an accurate frequency to 
create the illusion of animation. All information we receive from this media, 
either temporal or non-temporal, is the result of this succession of images in 
time. To show the different aspects of the time-based perception of video, in the 
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following, we first explain the different levels of perception of video content and 
then the different time dimensions related to these levels. 

3.1 Different Levels of Perception of Video Content 

In a first time, we define two principal levels of perception of video content: 
visual perception and semantic perception. Visual perception corresponds to the 
elementary information received when watching a video, independent of any 
abstraction attached to them. These information are: the presence of pictorial 
element^ and the change of spatio-temporal characteristics of these elements. 
An example of visual perception is the perception of the presence of a circle 
and its going up and then coming down in a sequence of images. This level of 
perception is independent of the different video and application contexts. 

Once pictorial elements are observed, abstractions are made through these 
elements towards different concept^. Such abstractions depend mainly on the 
context of the presentation of video. For example, the circle of last example can 
be perceived in a particular context as the sun and its going up and coming 
down as the sunrise and the sunset. This abstraction process may continue in a 
hierarchical way: new concepts can be created as abstraction of one or several 
other concepts. Then, in the last example, the sunset followed by the sunrise 
may be considered as a ’’day” and so on. 

The abstraction of the content information explained above does not end 
with the construction of concepts (and relations). The understanding of the video 
content leads finally to the perception of a story. In fact, the story is first created 
by the maker of the video document by the intermediate of the techniques and 
the art of cinematography (the cutting of video into shots and scenes is part of 
this artwork) . Then, the comprehension of the cinematographical language by the 
spectator reshapes the story. The notion of story has a primordial importance 
in the semantic content information perceived in a video as it represents the 
conceptual description of the video content for the spectator. 

Each story contains a set of events, which are the dominant facts happenings 
during the story. An event is formed by a set of concepts and relations abstracted 
in the semantic level. According to the context of video, an event may represent 
the presence of certain objects or persons, an action realized by/on some objects 
or persons, etc. The events have temporal continuity in the story and so a de- 
scription of the whole story may be formed by describing the events in time. Here 
we note that whereas the concepts and relations forming the interior structure 
of events are context dependent, the construction of the whole semantic struc- 
ture of video by events related in time is independent of video and application 
contexts and consists a general semantic structure for video documents. 

^ A pictorial element is defined as a set of pixels that verify common criteria in a 
sequence of images. Such criterion may be the form or texture created by the set of 
pixels, their color, etc. 

^ For the sake of simplicity, we use the term of concept for the result of an abstraction 
process. We will see later that in fact there exist concepts and also relations between 
concepts. 
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In the next section, we precise the modalities of temporal description of video 
by presenting two dimensions of time we define for a video document. 

3.2 Two Dimensions of Time for Video 

The notion of story creates a new time dimension perceived by the user, namely 
the story time. The story time is the time dimension during which the story of 
video ’’takes place” . This time dimension is different from the real time dimension 
of the video called the video time during which the story is ’’shown”. Events are 
then present in the story and also in the video. But the temporal characteristics 
of the events in the story and video are different. These differences are essentially 
present in time intervals of the events and also their temporal relationships as 
they happen in story and as they are shown in video. For example, an event 
shown in video during a few minutes may create the illusion of its happening 
during one day in the story. Another example is two concurrent events of story 
which happen in different places. To show this, the events are cut into a few 
smaller parts and shown alternatively. The flashback is another example where 
an event happening in story before another, is shown after in video. We have 
presented a detailed description of the temporal aspect of the video and its 
modeling in 0 . 

4 The Principles of Video Data Model 

Following the results of our study of time-based perception explained earlier, we 
now describe the principles of the proposed video model. Before passing to such 
detailed description, in order to present an intuitive vision of the kinds of queries 
that we would expect the model to reply, we present here a few such example^. 

Q1 The video portions in which we see Garry Cooper dancing with a woman. 
Q2 : The video portions in which we see Alfred Hitchcock for at least 1 minute. 
Q3 The video portions in which we see a car falling down to a valley after a 
car pursuit. 

Q4 : The video portions in which an explosion happens at the same time that 
two persons are talking together. 

Q5 : The video portions in which we see soldiers the day after the war. 

Considering the characteristics of the time-based perception of videos from one 
side and the type of queries that are excepted to be resolved on the other side, 
here we present the principles, which will be provided by our model: 

1. The representation of the event as the base unit of content description. The 
example of an event in the above queries is ’’Garry Cooper dancing with a 
woman” in Q1 or ’’Two persons talking together” in Q4. 

® For the sake of simplicity the examples given here are taken from the film videos, 
however, as the principles described jnst after will reveal, any other type of video 
(medical, object tracking, etc.) may be represented using this approach. 
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2. The representation of the interior construction of events from concepts and 
relations. For example the event in Q1 is constructed from the concept in- 
stances ’’Garry Cooper” and ’’woman” and the relation ’’dancing” between 
these two concepts. 

3. The representation of the temporal relations between the events. For example 
the relation ’’after” between the two events ”a car falling down to a valley” 
and ”a car pursuit” in Q3. We remind from the definition of the event given 
in 3.1 that events are considered as temporal units of description and so we 
may have temporal relations only between the events and not inside each 
one. Another important point is: as we aim to represent both the video and 
the story time, we need to represent both types of video and story temporal 
relations between the events. 

4. The attribution of story and video time intervals to the events. The video 
time interval has the extra utility of determining the boundaries of the video 
portion containing each represented event. We precise here that as described 
in the section 3.2 an event can be related to several video intervals. In this 
case, we consider the use of minimum surrounding interval of a set of intervals 
of S, namely MSI(S), as the interval that begins with the first begin of the 
intervals of S, and that ends with the last ending of the intervals of S. The 
video temporal relations are then based on the MSI of the intervals related 
to the events. 

Despite the so-called principles, there are some other key points, which give 
to the presented model a high power of expression. One of these aspects is the 
possession of the hierarchies of concepts and relations which describe the special- 
ization relation existing between the concepts (and relations) . We have described 
these hierarchies using the concept and relation lattices of the conceptual graph 
formalism (ref. 5.1). These hierarchies are specially used during the matching 
phase of the video retrieval. 

In order to describe the details of our video model with precision and to avoid 
giving separate pieces of description, we prefer describe the model directly using 
the conceptual graphs, which we find quite adequate as a relational knowledge 
representation formalism and specially to support the logical basis of the retrieval 
model. Amongst the most important advantages of using this formalism, we 
name the three following: 

1. There exists a strong relation between the conceptual graph formalism and 
the first order logic and consequently with the logical model of information 
retrieval 0. 

2. There exist algebraic operators that are in accordance with the logical inter- 
pretation of conceptual graphs. This leads to a strong theoretical validation 
of the formalism synonym to the strong validation theory m- 

3. The algebraic interpretation gives a basis to achieve query processing with 
polynomial complexity when adequate preprocessing is performed m- 
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5 Conceptual Graphs 

To facilitate the understanding of the model proposed in section 6, we present 
in the following, first an introduction to the conceptual graphs formalism and 
then its application to the information retrieval. 

5.1 An Introduction to the Conceptual Graphs Formalism 

Conceptual graphs are knowledge representation formalism based on the lin- 
guistics, psychology and philosophy defined by Sowa HH. A conceptual graph 
represents information as a finite, connected, oriented, bipartite graph having 
two types of nodes: concepts and relations. Concepts have a type (which corre- 
sponds to a semantic class) and possibly a referent (which corresponds to an 
instantiation to an individual of the class). There exist two categories of refer- 
ents: individual referents, each of which designates a particular individual and 
the generic referent noted by * which designates any individual referent conform 
to the type of the concept. Conceptual relations specify the relation which exists 
between the concepts of the graph. The relations are identified by a type and 
they give a direction to the conceptual graph containing them. 

An example of a conceptual graph is given below. This graph represents the fact 
”A young man is talking to Mary”. 

[Man: *] ^ (Talking) ^ [Woman: Mary] 

— > (Has_Character) ^ [Character: Young] 

The formalism defines a knowledge base that contains a concept lattice and a 
relation lattice. The concept lattice, Tq, is a set of concept types. Tc is provided 
with a partial ordering relation <c. The relation lattice, Tfj, is a set of relation 
types. Tn is provided with a partial ordering relation <s (The partial ordering 
relations <c and <s represent the notion of Generalization/Specialization). The 
set of concepts and relations in Tc and Tn are relatively restricted between 
Tope, Bottomc and Topn , Bottomn. 

In the conceptual graphs formalism, eanonical graphs represent the possible 
situations of the real world. These graphs express the valid combinations of 
the concepts and relations. There exists a set, namely the base, of canonical 
graphs that are defined a priori and which express the elementary semantic 
constraints of the represented domain. Other canonical graphs are derived from 
the canonical base by the canonical operations of copy, join, restriction, and 
simplification proposed in the conceptual graph formalism. 

If a graph g2 is derived from a graph gl, g2 is a specialization of gl: g2 < gl. 

5.2 Conceptual Graphs and Information Retrieval 

The important advantage of the conceptual graphs, besides their flexibility and 
expressive power is specially their connection with the logical model of informa- 
tion retrieval . This connection is due to the explicit relation of the formalism 
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with the first order logic. Sowa m defined an operator D , which permits to 
associate to each graph u a formula W (u) expressed in first order logic. This 
operator has the property of conserving an order on the graphs: if m < w (m is 
a specialization of v) then the associated logical formulas verify the following 
implication: D D D (v). 

The conceptual graph formalism permits a translational semantic towards 
the first order logic: using the operator D the conceptual graph expressions are 
translated to the first order logic expressions. The relationship between this 
formalism and a retrieval process is: a document indexed by a conceptual graph 
D is relevant to a query represented by the graph <5 if D is a specialization of Q, 
i.e., D (D) D D (Q). The realization of such implication is achieved on conceptual 
graphs using the projection operation: if it < ?; then there exists a projection of 
V on It. So using the projection operation of conceptual graphs we can affirm 
that D < Q and hence D (D) D D (Q). 

6 Video Data Model Presented by Conceptual Graphs 

An information retrieval model is usually presented in three distinct parts: docu- 
ment model, query model and the matching function. Using the conceptual graph 
formalism, the queries and documents are represented each one as a graph and 
the matching function consists of the projection of the document graph into the 
query graph. To describe the video model, we give in the following the details of 
the document model, which are exactly the same ones for the query model. 

To describe the document model, we present the canonical base U, the con- 
cept lattice Tc and the relation lattice T^. 

Each video document is composed of a set of events. The content of each 
event is described by a graph. The initial graphs of C are thus the following: 

[Video] ^ ( Is_Composed_Of ) — > [Event] 

[Event] ^ ( Has_Content ) — > [Content] 

The referents of the concept Video are unique identifiers for the instances of the 
video concept type. Content is itself a graph whose construction, as explained 
before, depends on the characteristics of each particular domain. To distinguish 
clearly between the generic and specific aspects of the model, we present the 
description of the Content graph by providing separate specific canonical base: 
Csp, specific concept lattice: Tcsp, and specific relation lattice: Tpsp- 

As example, we describe the construction of the content graph for film video 
documents. In that case, between the most important elements of events we 
may consider Persons, Objects, Actions and Locations. Persons and Objects are 
regrouped to a more general concept. Entity. Actions are performed by Entities 
on other Entities. They may be simple or complex. 

For example holding is considered as a simple action: someone holds some- 
thing, whereas taking is an action which may be complex: someone takes some- 
thing from some other one. The following graphs of the canonical base, Csp, 
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Fig. 1. An example of (a) specific concept lattice and (b) specific relation lattice. 



represent the above descriptions. 

[Entity] — > (Simple_Action) ^ [Entity] 

[Entity] ^ (Complex_Action) ^ [Entity] 

^ [Entity] 

The referents of the concept types Entity, Person and Object are unique iden- 
tifiers for the relative instances. Besides the unique identifier, each entity may 
have a name, normally of string type. The following graph of Csp represents the 
attribution of a Name to each Entity. 

[Entity] ^ (Has_Name) — > [Name] 

On the other side entities are related to certain Locations, represented by the 
two following graphs of Csp. 

[Entity] ^ (Has_Location) — > [Location] 

[Location] ^ (HasJMame) — > [Name] 

The related Tcsp and T^sp are represented in the figure 1. In the above lattices 
more specialisation of the concepts and relations may be added to include more 
details on the information related to the domain. 

Besides the content graph related to each event, there exist important tem- 
poral information representing the time interval of the event and its temporal 
relationships with other events. These information are part of the generic charac- 
teristics of the model as they do not depend to the particular domains: whatever 
are the elements inside the event, the temporal characteristics of the event unit 
is the same in different domains. In the section 4 we explained that to each 
event we accord a video and a story time interval and that the events are related 
by video and story temporal relations. To represent the temporal relations we 
use the Moulin relations H3| which are a refinement of the well known Allen 
Relations ra- In fact, Moulin proposes the representation of all Allen Relations 
by only two relations Before and During and the parameter Lap bound to Be- 
fore and DB and DE bound to During. The attribution of the negative, zero. 
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or positive values to these parameters permit at the same time, the distinction 
of Allen relations and also the quantification of these relations 0. The detailed 
description of the temporal modeling of video using these relations is given in 

0 - 

These temporal information are represented by the following graphs in the 
generic canonical base, C. We distinguish between the concepts and relation used 
relatively to represent the video and story time using the postfixes ” _V” and ” _S” . 



[Event] — > ( Before_V ) ^ [Event] 
— > [Lap_V] 

[Event] ^ ( During_V ) ^ [Event] 
^ [DB.V] 
^ [DE_V] 


[Event] — !■ ( Has_Beg_V) ^ [Time_V] 
[Event] ^ ( Has_End_V) ^ [Time_V] 
[Event] — > ( Has_Duration_V ) ^ [Duration] 


[Event] ^ ( Before_S ) — > [Event] 
— > [Lap_S] 

[Event] ^ ( During_S) ^ [Event] 
^ [DB.S] 
^ [DE_S] 


[Event] ^ ( HasJ3eg_S) ^ [Time_S] 
[Event] ^ ( Has_End_S) — *■ [Time_S] 
[Event] — > ( Has_Duration_S) ^ [Duration] 



The generic concept and relation lattices are presented in figures 2 and 3. In 
the generic relation lattice we note that the existence of the concepts During and 
Before which are the more generic concepts of During-V, DuringS and Before-V, 
BeforeS, allows to have queries which are careless of the distinction between the 
the two times and which insist on finding the events which are just concurrent 
or not. 

Defining an interval by its beginning and end points as Intvl = (Beg , End ) Moulin 
relations are as following: 

Before( Intvl 1, Intvl 2, Lap ) where Lap = End2- Begl 
During( Intvl 1, Intvl2, DB, DE ) where DB = Begl - Beg2 

and DE = End2 - Endl 
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Fig. 3. Generic relation lattice. 



We remind also that the instances of the concepts Time-V, TimeS, Lap-V, 
Lap-S, DB_V, DB_S, DE_V and DE_S should indeed represent a coding for the 
exact video and story time-points. To be precise, we propose a BNF-like no- 
tation for the representation these two kinds of times. We represent the story 
time-points by the notation [-][YYYY]:[MM]:[DD]:[HH]:[MM]:[SS]. For exam- 
ple, to represent the time 10:00 PM, we use the string ”:::22:00:”. The video 
time-points are represented using the SMPTE timecodes in a string format of 
[HH]:[MM]:[SS]:[FF]. 



7 Example 

Finally, in order to give a whole viewpoint of typical graphs (document or query), 
we consider here the example of news: an interview of 6 minutes, during which 
a person called A is interviewed by a person called B. During this interveiw we 
first see a one minute report on A in Japan in Septemer 1998 and then another 
one minute report on A in Africa in May 1999. As we see in the following, 
the proposed model permits the description of 3 events. By using two different 
dimensions of times, it is possible to represent the exact video temporal features 
of these events (specially the concurrency of the events 2 and 3 with the event 
1) and also the story temporal features (here the dates of the trips). 

[Video:#vl] ^ (Has_Event) ^ [Event:#el] 

[Video:#vl] ^ (Has_Event) ^ [Event:#e2] 

[Video:#vl] ^ (Has_Event) ^ [Event:#e3] 



[Event:#el] 
[Person: #b] 
[Person: #a] 
[Event: #el] 



(Has_Content) ^ [[Person:#b] — > (Talk) ^ [Person:#a]] 
(HasJMame) ^ [Name:”B”] 

(Has_Name) ^ [Name:” A”] 

[Has_Duration_V] — > [Duration:”00:00:06:00”] 



[Event:#e2[ ^ (Has.Content) — > [[Person:#a] ^ (Has_Location) 
^ [Location:#ll][ 

[Location:#!!] — > (HasJMame) — > [Africa] 







A Conceptual Graph Approach for Video Data Representation and Retrieval 535 



[Event:#e2] ^ (Has^eg.S) ^ [Time_S:” 1998:09::”] 
[Event:#e2] ^ [Has_Duration_V] — > [Duration:”00:00:01:00”j 



[Event:#e3] 

[Location:#12] 

[Event:#e3] 

[Event:#e3] 



(Has-Content) — > [[Person:#a] ^ (Has_Location) 
^ [Location:#12j] 

(Has_Name) ^ [Japan] 

(Has^eg.S) ^ [Time_S:” 1999:05::”] 
[Has_Duration_V] ^ [Duration:”00:00:01:00”] 



[Event:#e2] — > (During.V) ^ [Event:#el] 

^ [DB.V:”00:00:00:30] 

^ [DE.V:” 00:00:04:30] 

[Event:#e3] — > (During.V) ^ [Event:#el] 

^ [DB.V:”00:00:02:00] 

^ [DE.V:”00:00:03:00] 

This description allows to find : 

1. interviews including reports on the same person (using co-referents), 

2. interviews including the events that happaend in a given date or time, 

3. interviews including reports on a person in a given place, 

4. any combination of above queries, etc. 

8 Conclusion 

This paper described a new approach to the modeling of the content information 
of video documents besides a retrieval schema corresponding to the proposed 
model. We have used conceptual graphs as the support to represent the pro- 
posed model. Using this formalism facilitated the description of the semantic 
structure we propose for the video documents through the representation of the 
legal bindings of concepts and relations and also the hierarchies describing the 
relations of specialization/generalization existing between them. Using separate 
canonical bases and concept and relation lattices, we arrived at presenting the 
model separating the generic and specific aspects of video and so offering a more 
generic model. The representation of the temporal aspect of the videos using the 
same formalism is one of the other important points of the proposed model. 

The presented video model may be extended in different axes. At query level, 
studies should be done to determine the different modalities of temporal descrip- 
tion in natural language and their correspondence to the temporal description 
using the well-known Allen relations. This study will permit the definition of a 
simple and natural query interface and also the principles of processing of such 
queries. To extend the matching function, we consider the study of the possi- 
bilities of involving temporal characteristics during the matching process. This 
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may provide us useful measures permitting the determination of different levels 
of relevance replacing the exact matching. 

Finally we consider the continuation of the current study from an indexing 
point of view. The challenge in this direction will be the determination of a set of 
physical characteristics permitting to automate as much as possible the process 
of derivation of events and their temporal characteristics from the video content. 
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